diff --git a/CLAUDE.md b/CLAUDE.md index 9451852..5b1437b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -28,7 +28,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r - `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). - `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new issuer/binding here, not inline. - `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. -- `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. Add platform mappings to `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. +- `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. **C2PA platform attribution is claim-generator-first, issuer-scan fallback** (`_CLAIM_GENERATOR_PLATFORM` then `_ISSUER_PLATFORM`): the claim generator (via the now-public `c2pa.cbor_text_after(head, b"claim_generator")` for non-PNG) names what *produced* the asset, while the issuer byte-scan matches incidental chain mentions. **Verified on real signed files 2026-05-26:** the old issuer-only scan mis-attributed Leica→"Truepic" (timestamp authority), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), and Truepic→"Google"; claim-generator-first fixes all three. `_CLAIM_GENERATOR_PLATFORM` only lists device tokens **verified against a real C2PA file** (Leica `lc_c2pa`, Nikon, Truepic Lens) -- Pixel/Samsung/Sony/Canon/Bria are documented but NOT added until a real signed sample confirms the in-manifest string (same fixture discipline as Grok/Doubao/Ideogram). Camera C2PA marks capture authenticity, not AI, so these never set `is_ai` (that stays driven by digital-source-type). Add platform mappings to `_CLAIM_GENERATOR_PLATFORM` / `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. - `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. - `invisible_watermark.py` — `detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required. **Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs. - `trustmark_detector.py` — `detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded. diff --git a/pyproject.toml b/pyproject.toml index cc68451..97ba4dc 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "remove-ai-watermarks" -version = "0.6.0" +version = "0.6.1" description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)" readme = "README.md" requires-python = ">=3.10" diff --git a/src/remove_ai_watermarks/__init__.py b/src/remove_ai_watermarks/__init__.py index 4162aa7..a5b24f4 100644 --- a/src/remove_ai_watermarks/__init__.py +++ b/src/remove_ai_watermarks/__init__.py @@ -1,3 +1,3 @@ """Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks.""" -__version__ = "0.6.0" +__version__ = "0.6.1" diff --git a/src/remove_ai_watermarks/identify.py b/src/remove_ai_watermarks/identify.py index 79778c2..c31a4f4 100644 --- a/src/remove_ai_watermarks/identify.py +++ b/src/remove_ai_watermarks/identify.py @@ -34,7 +34,7 @@ from remove_ai_watermarks.metadata import ( iptc_ai_system, xai_signature, ) -from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, soft_binding_vendors_in +from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_ISSUERS if TYPE_CHECKING: @@ -127,6 +127,35 @@ def _ai_tools_in(data: bytes) -> list[str]: return sorted({name for sig, name in C2PA_AI_TOOLS.items() if sig in data}) +# C2PA claim-generator substring -> platform. The claim generator names what +# PRODUCED the asset, so it is far more reliable than byte-scanning the manifest +# for an issuer name (which also matches incidental mentions: a timestamp +# authority like "Truepic" in a Leica chain, an XMP-toolkit "Adobe" string in a +# Nikon file, or "Google" in a URL -- all verified on real samples). Ordered: +# camera tokens first so a device wins over an incidental tool name (Nikon's +# real sample claim generator also contains "Adobe_MAX"). Camera C2PA marks +# CAPTURE authenticity, not AI, so these never assert is_ai on their own (the +# verdict still comes from the digital-source-type). Only tokens verified +# against a real signed file are listed; add more as samples are captured. +_CLAIM_GENERATOR_PLATFORM: tuple[tuple[str, str], ...] = ( + ("lc_c2pa", "Leica (camera, C2PA capture)"), + ("leica", "Leica (camera, C2PA capture)"), + ("nikon", "Nikon (camera, C2PA capture)"), + ("truepic", "Truepic Lens (verified capture)"), +) + + +def _platform_from_generator(generator: str | None) -> str | None: + """Map a C2PA claim-generator string to a device/platform, or None.""" + if not generator: + return None + low = generator.lower() + for token, platform in _CLAIM_GENERATOR_PLATFORM: + if token in low: + return platform + return None + + def _attribute_platform(issuers: list[str]) -> str | None: """Map a set of C2PA issuer names to a human-readable generating platform.""" joined = " ".join(issuers) @@ -205,12 +234,20 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b # ── C2PA Content Credentials ──────────────────────────────────── has_c2pa = bool(info) or b"c2pa" in head.lower() or C2PA_UUID in head issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head) - platform = _attribute_platform(issuers) if has_c2pa else None c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any( m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia") ) - # Generator: structured for PNG, binary-scanned for other containers. - generator = info.get("claim_generator") or (", ".join(tools) if (tools := _ai_tools_in(head)) else None) + # Generator: structured for PNG, CBOR-scanned for other containers. The claim + # generator is the authoritative "what produced this", so it drives platform + # attribution; the issuer byte-scan is only the fallback (it matches + # incidental chain/namespace mentions -- on real samples Leica mis-read as + # Truepic, Nikon as Adobe, Truepic as Google until claim-generator took over). + generator = ( + info.get("claim_generator") + or cbor_text_after(head, b"claim_generator") + or (", ".join(tools) if (tools := _ai_tools_in(head)) else None) + ) + platform = (_platform_from_generator(generator) or _attribute_platform(issuers)) if has_c2pa else None if has_c2pa: detail = ", ".join(filter(None, [", ".join(issuers), generator, info.get("source_type")])) signals.append(Signal("c2pa", detail or "C2PA manifest present", "high")) diff --git a/src/remove_ai_watermarks/noai/c2pa.py b/src/remove_ai_watermarks/noai/c2pa.py index 2045a1e..a0fa632 100644 --- a/src/remove_ai_watermarks/noai/c2pa.py +++ b/src/remove_ai_watermarks/noai/c2pa.py @@ -131,7 +131,7 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]: return c2pa_info -def _cbor_text_after(payload: bytes, key: bytes) -> str | None: +def cbor_text_after(payload: bytes, key: bytes) -> str | None: """Return the CBOR text-string immediately following ``key`` in ``payload``. Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte @@ -212,9 +212,9 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None: # Guard with isprintable(): on some manifests (e.g. Microsoft Designer) the # first ``name`` key precedes a binary field (a hash), not the generator # string, which would otherwise surface as control-char garbage. - if (generator := _cbor_text_after(chunk_data, b"name")) and generator.isprintable(): + if (generator := cbor_text_after(chunk_data, b"name")) and generator.isprintable(): c2pa_info["claim_generator"] = generator - if (spec := _cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable(): + if (spec := cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable(): c2pa_info["c2pa_spec"] = spec # Find actions diff --git a/tests/test_identify.py b/tests/test_identify.py index 59fecc4..159c7b3 100644 --- a/tests/test_identify.py +++ b/tests/test_identify.py @@ -337,6 +337,25 @@ class TestIdentifyIptcAi: assert "Gemini" in r.platform +class TestIdentifyC2paClaimGenerator: + """C2PA attribution prefers claim_generator over incidental issuer tokens.""" + + def test_claim_generator_beats_incidental_tokens(self, tmp_path: Path): + # Real manifests mention timestamp authorities / XMP toolkits, so + # "Adobe"/"Google"/"Truepic" appear incidentally; the claim generator + # (a Leica camera) must win the platform attribution. Regression guard + # for the real-sample mis-attribution (Leica->Truepic, Nikon->Adobe). + gen = b"M11-P/2.0.1 lc_c2pa" + blob = ( + b"\xff\xd8\xff\xe1 c2pa.claim jumbf Adobe Google Truepic " + b"claim_generator" + bytes([0x60 + len(gen)]) + gen + b" \xff\xd9" + ) + p = tmp_path / "leica_like.jpg" + p.write_bytes(blob) + r = identify(p, check_visible=False, check_invisible=False) + assert r.platform == "Leica (camera, C2PA capture)" + + # ── Open invisible watermark (SD/SDXL/FLUX) integration ───────────── from remove_ai_watermarks.invisible_watermark import is_available as _wm_available # noqa: E402 diff --git a/tests/test_noai.py b/tests/test_noai.py index a1a804f..1435ce9 100644 --- a/tests/test_noai.py +++ b/tests/test_noai.py @@ -8,8 +8,8 @@ from pathlib import Path import pytest from remove_ai_watermarks.noai.c2pa import ( - _cbor_text_after, _parse_c2pa_chunk, + cbor_text_after, extract_c2pa_chunk, extract_c2pa_info, has_c2pa_metadata, @@ -193,36 +193,36 @@ class TestC2PAInjectValidation: class TestCborTextAfter: - """_cbor_text_after handles the three CBOR text-string length prefixes.""" + """cbor_text_after handles the three CBOR text-string length prefixes.""" def test_direct_length(self): # major-type 3, direct length (0x60 + len). "abc" -> 0x63. payload = b"name" + bytes([0x63]) + b"abc" - assert _cbor_text_after(payload, b"name") == "abc" + assert cbor_text_after(payload, b"name") == "abc" def test_one_byte_length(self): s = b"x" * 30 payload = b"name" + bytes([0x78, 30]) + s - assert _cbor_text_after(payload, b"name") == "x" * 30 + assert cbor_text_after(payload, b"name") == "x" * 30 def test_two_byte_length(self): s = b"y" * 300 payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s - assert _cbor_text_after(payload, b"name") == "y" * 300 + assert cbor_text_after(payload, b"name") == "y" * 300 def test_key_not_found_returns_none(self): - assert _cbor_text_after(b"nothing here", b"name") is None + assert cbor_text_after(b"nothing here", b"name") is None def test_key_at_end_returns_none(self): - assert _cbor_text_after(b"prefixname", b"name") is None + assert cbor_text_after(b"prefixname", b"name") is None def test_invalid_head_returns_none(self): # 0x00 is not a text-string head. - assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None + assert cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None def test_latin1_fallback_on_invalid_utf8(self): payload = b"name" + bytes([0x61]) + b"\xff" # len 1, invalid utf-8 - assert _cbor_text_after(payload, b"name") is not None + assert cbor_text_after(payload, b"name") is not None class TestSynthIDVerdict: diff --git a/uv.lock b/uv.lock index 179d5d4..2873053 100644 --- a/uv.lock +++ b/uv.lock @@ -2784,7 +2784,7 @@ wheels = [ [[package]] name = "remove-ai-watermarks" -version = "0.6.0" +version = "0.6.1" source = { editable = "." } dependencies = [ { name = "click" },