mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-07-04 23:47:49 +02:00
feat(metadata): blank AI-generator tokens in AVIF/HEIF Exif meta-box items
Closes a documented coverage gap (P2#9): an AI Software/Make/Artist/ImageDescription token in an EXIF item (its TIFF bytes live in mdat/idat) survived remove_ai_metadata because the top-level box stripper and (absent pillow-heif) the PIL EXIF reader can't reach it. New isobmff.blank_ai_exif_tokens finds EXIF TIFF blocks by their II/MM byte-order header, validates each with piexif (a coincidental II/MM run in pixels won't parse as a TIFF IFD, so it's ignored), and overwrites any AI_GENERATOR_TOKENS- bearing value with same-length spaces -- so box sizes and iloc offsets stay valid and the coded image is untouched (mirrors blank_ai_xmp_packets; no iinf/iloc surgery, no exiftool dep). Camera/editor EXIF without an AI token is preserved. Wired into remove_ai_metadata's ISOBMFF path. Covers the realistic AI-generator-token case; xAI- signature-in-meta-box-EXIF (Grok is JPEG-only) stays out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -81,7 +81,7 @@ Compact list. Full measurements, incident history, and oracle-validation runs li
|
||||
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files.
|
||||
- A third-party PIL plugin autoload (e.g. an HEIF/AVIF plugin) can raise a non-OSError (`ModuleNotFoundError`), not `UnidentifiedImageError`, when opening a file. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`.
|
||||
- rich was dropped: the CLI + analysis scripts print plain text (`click.echo` / the `scripts/_plain_console.py` shim). `rich` is NOT a dependency — importing it breaks the core+dev CI sync; new scripts must use the shim. No Unicode glyphs / colors / progress bars in CLI output by design.
|
||||
- AVIF/HEIF/JPEG-XL metadata detection is a binary scan; C2PA removal in those containers (and MP4/MOV/M4V) is `noai/isobmff.py`; non-ISOBMFF audio/video (WebM/MP3/WAV/FLAC/OGG) strips losslessly via ffmpeg on PATH. Still NOT built: an `Exif` meta-box *item* (needs `iinf`/`iloc` surgery) and Resemble PerTh audio detection (no presence/confidence flag exists).
|
||||
- AVIF/HEIF/JPEG-XL metadata detection is a binary scan; C2PA removal in those containers (and MP4/MOV/M4V) is `noai/isobmff.py`; non-ISOBMFF audio/video (WebM/MP3/WAV/FLAC/OGG) strips losslessly via ffmpeg on PATH. An AI-generator token in an `Exif` meta-box *item* (bytes in `mdat`/`idat`) is now blanked **in place** by `isobmff.blank_ai_exif_tokens` (same-length space overwrite, piexif-validated so a coincidental II/MM run in pixels is ignored — no `iinf`/`iloc` surgery, mirrors `blank_ai_xmp_packets`); it scrubs the AI-token value only, leaving camera/editor EXIF intact. Still NOT built: Resemble PerTh audio detection (no presence/confidence flag exists).
|
||||
- **SynthID technical reference: `docs/synthid.md`** — primary-source-cited doc covering mechanism (post-hoc encoder/decoder pair, 136-bit payload at 512x512, pixel-space, model weights NOT modified), robustness numbers (arXiv:2510.09263: ~99.98% TPR@0.1%FPR across 30 transforms including JPEG/crop/resize/color/noise), removal attacks and forensic detectability (arXiv:2605.09203: all 6 attacks detectable at >98% TPR@1%FPR), detectability limits (no public decoder, metadata-proxy only), oracle scope, and adoption landscape. Read that doc first before adding notes here.
|
||||
- **SynthID detection is metadata-only.** No local pixel detector is possible by design (Google's decoder is proprietary, trusted-testers only); we read the C2PA companion proxy, which goes quiet once metadata is stripped — a quiet proxy is not proof the pixel watermark is gone. The Gemini app "Verify with SynthID" is the ONLY valid SynthID oracle; `openai.com/verify` is scoped to OpenAI provenance and each vendor's oracle detects only its own content. SynthID survives JPEG re-encode, so GitHub issue attachments remain valid pixel-watermark test subjects. Every spectral/phase detection approach evaluated (reverse-SynthID, our own probes) works only on controlled solid fills, never on real content.
|
||||
- **External AI-vs-real classifier models are out of scope** (decided 2026-05-24): per-generator, degrade off-distribution, and our own light SDXL pass would likely defeat them. Detection stays local + signal-based.
|
||||
|
||||
@@ -41,7 +41,7 @@ Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID`
|
||||
|
||||
**Meta-box XMP now handled (`isobmff.blank_ai_xmp_packets`, v0.6.9):** an AI-label XMP packet stored as a meta-box `mime` item (AVIF/HEIF) is blanked in place (overwritten with spaces of the same length, so `iloc` offsets and the coded image stay valid).
|
||||
|
||||
**Still NOT built:** an `Exif` *item* inside the `meta` box (rare -- AI labels are XMP) needs full `iinf`/`iloc` surgery (offset rewrite) with corruption risk -- exiftool (R/W/C for HEIC/AVIF EXIF+XMP, verified on exiftool.org 2026-05-27) would do it but is a non-installed binary dep, so it stays a documented gap.
|
||||
**`Exif` item inside the `meta` box (AVIF/HEIF), now handled in place (2026-06-19):** an AI-generator token in an EXIF item (its TIFF bytes live in `mdat`/`idat`) is blanked by `isobmff.blank_ai_exif_tokens` — it finds EXIF TIFF blocks by their II/MM byte-order header, validates each with **piexif** (a coincidental II/MM run in pixel data won't parse as a TIFF IFD, so it is ignored), and overwrites any `Software`/`Make`/`Artist`/`ImageDescription` value carrying an `AI_GENERATOR_TOKENS` token with spaces of the **same length**. Same-length means every box size and `iloc` offset stays valid and the coded image is untouched — so it avoids the full `iinf`/`iloc` surgery (offset rewrite) that exiftool would need (exiftool is a non-installed binary dep, deliberately not used). It scrubs only the AI-token value; camera/editor EXIF is preserved. Wired into `remove_ai_metadata`'s ISOBMFF path after `blank_ai_xmp_packets`. Limitation: covers the AI-generator-token case (the realistic one); a future xAI-signature-in-meta-box-EXIF (Grok is JPEG-only today) is not separately handled. **Still NOT built:** Resemble PerTh audio detection (no presence/confidence flag exists).
|
||||
|
||||
**Audio watermark DETECTION (Resemble PerTh) was evaluated and NOT built (2026-05-26):** `resemble-perth`'s `PerthImplicitWatermarker.get_watermark()` returns a raw bit-array with **no presence/confidence flag** (clean audio decodes to arbitrary bits too), so reliably distinguishing watermarked-from-clean needs either Resemble's fixed payload or a confidence API -- neither is public, and there's no real Resemble sample to calibrate against. Same wall-class as the SynthID pixel detector: the decode exists, reliable presence-detection does not. (perth's top-level `PerthImplicitWatermarker` is also gated to None unless `librosa` is importable.)
|
||||
|
||||
|
||||
@@ -871,23 +871,32 @@ def remove_ai_metadata(
|
||||
# codestream bit-for-bit. MP4/MOV/M4A are ISOBMFF too, so the same top-level
|
||||
# uuid/jumb box walker applies. Route by suffix OR by an ``ftyp`` content
|
||||
# sniff, so a correctly-shaped container is handled whatever its extension.
|
||||
from remove_ai_watermarks.noai.isobmff import blank_ai_xmp_packets, is_isobmff, strip_c2pa_boxes
|
||||
from remove_ai_watermarks.noai.isobmff import (
|
||||
blank_ai_exif_tokens,
|
||||
blank_ai_xmp_packets,
|
||||
is_isobmff,
|
||||
strip_c2pa_boxes,
|
||||
)
|
||||
|
||||
with open(source_path, "rb") as f:
|
||||
head = f.read(12)
|
||||
if source_path.suffix.lower() in _ISOBMFF_EXTS or is_isobmff(head):
|
||||
data = source_path.read_bytes()
|
||||
# Top-level uuid/jumb boxes (C2PA + AI-label XMP), then AI-label XMP that
|
||||
# lives inside a meta-box ``mime`` item (HEIF/AVIF) -- blanked in place so
|
||||
# box sizes and iloc offsets stay valid and the coded image is untouched.
|
||||
# Top-level uuid/jumb boxes (C2PA + AI-label XMP), then the meta-box items
|
||||
# the top-level stripper can't reach (HEIF/AVIF store them in mdat/idat):
|
||||
# AI-label XMP packets and AI-generator tokens in an Exif item -- both
|
||||
# blanked in place (same length) so box sizes and iloc offsets stay valid
|
||||
# and the coded image is untouched.
|
||||
cleaned, stripped = strip_c2pa_boxes(data)
|
||||
cleaned, blanked = blank_ai_xmp_packets(cleaned)
|
||||
cleaned, exif_blanked = blank_ai_exif_tokens(cleaned)
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.write_bytes(cleaned)
|
||||
logger.info(
|
||||
"Stripped %d AI-provenance box(es), blanked %d meta-box XMP packet(s) → %s",
|
||||
"Stripped %d AI-provenance box(es), blanked %d meta-box XMP packet(s) + %d EXIF token(s) → %s",
|
||||
stripped,
|
||||
blanked,
|
||||
exif_blanked,
|
||||
output_path,
|
||||
)
|
||||
return output_path
|
||||
|
||||
@@ -20,7 +20,7 @@ from __future__ import annotations
|
||||
import logging
|
||||
import re
|
||||
import struct
|
||||
from typing import TYPE_CHECKING
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from collections.abc import Iterator
|
||||
@@ -157,10 +157,11 @@ def strip_c2pa_boxes(data: bytes) -> tuple[bytes, int]:
|
||||
All other boxes (incl. ``mdat`` / codestream) are emitted verbatim, so pixel
|
||||
and audio data is preserved bit-for-bit. Non-ISOBMFF input is returned
|
||||
unchanged. Despite the name this also covers MP4/MOV/M4A video and audio
|
||||
(all ISOBMFF). NOTE: this drops only top-level boxes. An AI-label XMP packet
|
||||
stored as an *item inside the ``meta`` box* (typical for AVIF/HEIF) is handled
|
||||
separately by :func:`blank_ai_xmp_packets`; an ``Exif`` meta-box item is still
|
||||
not removed (would need meta-box surgery) and remains a documented limitation.
|
||||
(all ISOBMFF). NOTE: this drops only top-level boxes. AI metadata stored as an
|
||||
*item inside the ``meta`` box* (typical for AVIF/HEIF) is handled separately and
|
||||
in place (same length, no offset rewrite): AI-label XMP by
|
||||
:func:`blank_ai_xmp_packets`, and AI-generator tokens in an ``Exif`` item by
|
||||
:func:`blank_ai_exif_tokens`.
|
||||
"""
|
||||
if not is_isobmff(data):
|
||||
return data, 0
|
||||
@@ -223,3 +224,64 @@ def blank_ai_xmp_packets(data: bytes) -> tuple[bytes, int]:
|
||||
return packet
|
||||
|
||||
return _XMP_PACKET_RE.sub(_scrub, data), blanked
|
||||
|
||||
|
||||
# EXIF TIFF byte-order headers: little-endian (II 0x2a 0x00) and big-endian
|
||||
# (MM 0x00 0x2a). A HEIF/AVIF ``Exif`` meta-box item stores its TIFF block in
|
||||
# ``mdat`` / ``idat``, so the block (and these headers) appear in the raw bytes.
|
||||
_TIFF_HEADERS: tuple[bytes, ...] = (b"II\x2a\x00", b"MM\x00\x2a")
|
||||
# How far past a TIFF header an EXIF block plausibly extends; bounds the slice we
|
||||
# hand to piexif and search within (EXIF blocks are small kilobyte-scale).
|
||||
_EXIF_WINDOW = 256 * 1024
|
||||
|
||||
|
||||
def blank_ai_exif_tokens(data: bytes) -> tuple[bytes, int]:
|
||||
"""Overwrite (with spaces, in place) any AI-generator token in an EXIF block
|
||||
stored as an ISOBMFF ``meta``-box ``Exif`` item; return ``(data, blanked_count)``.
|
||||
|
||||
HEIF/AVIF can carry EXIF as a ``meta``-box ``Exif`` item whose TIFF bytes live
|
||||
in ``mdat`` / ``idat`` -- out of reach of the top-level box stripper, and (when
|
||||
no pillow-heif plugin is installed) of the PIL EXIF reader too, so an AI
|
||||
``Software`` / ``Make`` / ``Artist`` / ``ImageDescription`` tag there survived
|
||||
``remove_ai_metadata`` (a documented gap). This locates EXIF TIFF blocks by
|
||||
their byte-order header, **validates each with piexif** (so a coincidental
|
||||
II/MM run in pixel data is ignored -- it will not parse as a TIFF IFD), and
|
||||
overwrites any value carrying an ``AI_GENERATOR_TOKENS`` token with spaces of
|
||||
the SAME length. Because the replacement is same-length, every box size and
|
||||
``iloc`` offset stays valid and the coded image is untouched -- only the AI tag
|
||||
content is destroyed; camera/editor EXIF without an AI token is left intact
|
||||
(mirrors ``metadata._scrub_ai_exif`` and ``blank_ai_xmp_packets``).
|
||||
"""
|
||||
import piexif
|
||||
|
||||
from remove_ai_watermarks.noai.constants import AI_GENERATOR_TOKENS
|
||||
|
||||
ai_tags = (
|
||||
piexif.ImageIFD.Software,
|
||||
piexif.ImageIFD.Make,
|
||||
piexif.ImageIFD.Artist,
|
||||
piexif.ImageIFD.ImageDescription,
|
||||
)
|
||||
out = bytearray(data)
|
||||
blanked = 0
|
||||
for header in _TIFF_HEADERS:
|
||||
pos = data.find(header)
|
||||
while pos != -1:
|
||||
window = bytes(out[pos : pos + _EXIF_WINDOW])
|
||||
ifd: dict[int, Any] = {}
|
||||
try:
|
||||
ifd = piexif.load(window).get("0th", {})
|
||||
except Exception:
|
||||
ifd = {}
|
||||
for tag in ai_tags:
|
||||
value = ifd.get(tag)
|
||||
if not isinstance(value, bytes):
|
||||
continue
|
||||
if any(token in value.decode("latin1", "replace").lower() for token in AI_GENERATOR_TOKENS):
|
||||
# Blank the value bytes in place, within this EXIF block only.
|
||||
vpos = out.find(value, pos, pos + _EXIF_WINDOW)
|
||||
if vpos != -1:
|
||||
out[vpos : vpos + len(value)] = b" " * len(value)
|
||||
blanked += 1
|
||||
pos = data.find(header, pos + len(header))
|
||||
return bytes(out), blanked
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import shutil
|
||||
import struct
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
@@ -125,6 +126,24 @@ class TestHasAiMetadata:
|
||||
|
||||
assert not has_ai_metadata(out)
|
||||
|
||||
def test_remove_ai_metadata_blanks_exif_token_item_in_avif(self, tmp_path: Path):
|
||||
"""End-to-end: ``remove_ai_metadata`` blanks an AI-generator EXIF token
|
||||
stored as a meta-box Exif item (bytes in mdat) without re-encoding."""
|
||||
from remove_ai_watermarks.metadata import remove_ai_metadata
|
||||
|
||||
ftyp = b"\x00\x00\x00\x18ftypavif\x00\x00\x00\x00avifmif1"
|
||||
blob = piexif.dump({"0th": {piexif.ImageIFD.Software: b"Midjourney", piexif.ImageIFD.Make: b"NIKON"}})
|
||||
mdat = struct.pack(">I", 8 + len(blob)) + b"mdat" + blob
|
||||
src = tmp_path / "in.avif"
|
||||
src.write_bytes(ftyp + mdat)
|
||||
|
||||
out = tmp_path / "out.avif"
|
||||
remove_ai_metadata(src, out)
|
||||
cleaned = out.read_bytes()
|
||||
assert len(cleaned) == len(ftyp + mdat) # in place, no re-encode
|
||||
assert b"Midjourney" not in cleaned # AI token gone
|
||||
assert b"NIKON" in cleaned # camera tag preserved
|
||||
|
||||
def test_detects_iptc_trained_algorithmic_media_marker(self, tmp_path: Path):
|
||||
"""Some pipelines embed only the IPTC AI marker in XMP, no C2PA manifest."""
|
||||
path = tmp_path / "fake.jpg"
|
||||
|
||||
@@ -36,6 +36,7 @@ from remove_ai_watermarks.noai.extractor import (
|
||||
has_ai_metadata,
|
||||
)
|
||||
from remove_ai_watermarks.noai.isobmff import (
|
||||
blank_ai_exif_tokens,
|
||||
is_isobmff,
|
||||
strip_c2pa_boxes,
|
||||
)
|
||||
@@ -366,6 +367,44 @@ class TestISOBMFF:
|
||||
assert stripped == 0
|
||||
assert cleaned == data
|
||||
|
||||
@staticmethod
|
||||
def _avif_with_exif(exif_0th: dict) -> bytes:
|
||||
"""A fake AVIF (ftyp + mdat) whose mdat carries an EXIF TIFF block, as a
|
||||
HEIF/AVIF ``Exif`` meta-box item stores it (bytes in mdat)."""
|
||||
import piexif
|
||||
|
||||
blob = piexif.dump({"0th": exif_0th})
|
||||
mdat = struct.pack(">I", 8 + len(blob)) + b"mdat" + blob
|
||||
return FTYP + mdat
|
||||
|
||||
def test_blank_ai_token_in_exif_item(self):
|
||||
import piexif
|
||||
|
||||
data = self._avif_with_exif({piexif.ImageIFD.Software: b"DALL-E", piexif.ImageIFD.Make: b"Canon"})
|
||||
out, blanked = blank_ai_exif_tokens(data)
|
||||
assert blanked == 1
|
||||
assert len(out) == len(data) # same length -> box sizes / iloc stay valid
|
||||
assert b"DALL-E" not in out # AI token destroyed
|
||||
assert b"Canon" in out # camera tag preserved
|
||||
# The TIFF structure still parses, with the AI value blanked and Make kept.
|
||||
blob = out[out.index(b"Exif\x00\x00") + 6 :]
|
||||
ifd = piexif.load(blob)["0th"]
|
||||
assert ifd[piexif.ImageIFD.Software].strip() == b""
|
||||
assert ifd[piexif.ImageIFD.Make] == b"Canon"
|
||||
|
||||
def test_blank_leaves_clean_exif_untouched(self):
|
||||
import piexif
|
||||
|
||||
data = self._avif_with_exif({piexif.ImageIFD.Software: b"Adobe Photoshop", piexif.ImageIFD.Make: b"NIKON"})
|
||||
out, blanked = blank_ai_exif_tokens(data)
|
||||
assert blanked == 0
|
||||
assert out == data # no AI token -> byte-for-byte unchanged
|
||||
|
||||
def test_blank_no_exif_is_noop(self):
|
||||
out, blanked = blank_ai_exif_tokens(FTYP + b"\x00\x00\x00\x0cmdat" + b"pixels!!")
|
||||
assert blanked == 0
|
||||
assert out == FTYP + b"\x00\x00\x00\x0cmdat" + b"pixels!!"
|
||||
|
||||
|
||||
class TestC2PAInvalidSignature:
|
||||
"""A .png file that is not actually PNG-signed must read as clean, not crash."""
|
||||
|
||||
Reference in New Issue
Block a user