fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1)

Verified on real signed files that the issuer byte-scan mis-attributes multi-entity manifests: Leica read as "Truepic" (timestamp authority in the chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the claim generator (what produced the asset) and falls back to the issuer scan. - New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator read for non-PNG via the now-public c2pa.cbor_text_after. - Device tokens listed only where verified against a real C2PA file (Leica lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a real sample confirms the in-manifest string. Camera C2PA marks capture authenticity, so these never set is_ai. - cbor_text_after made public (was _cbor_text_after); call sites + tests updated. - Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-23 08:00:51 +02:00 · 2026-05-26 20:10:07 -07:00
parent 2676325184
commit dda2ee7fbb
8 changed files with 76 additions and 20 deletions
@@ -28,7 +28,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
 - `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path).
 - `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new issuer/binding here, not inline.
 - `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: <base64>` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output.
- `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. Add platform mappings to `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read.
+- `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. **C2PA platform attribution is claim-generator-first, issuer-scan fallback** (`_CLAIM_GENERATOR_PLATFORM` then `_ISSUER_PLATFORM`): the claim generator (via the now-public `c2pa.cbor_text_after(head, b"claim_generator")` for non-PNG) names what *produced* the asset, while the issuer byte-scan matches incidental chain mentions. **Verified on real signed files 2026-05-26:** the old issuer-only scan mis-attributed Leica→"Truepic" (timestamp authority), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), and Truepic→"Google"; claim-generator-first fixes all three. `_CLAIM_GENERATOR_PLATFORM` only lists device tokens **verified against a real C2PA file** (Leica `lc_c2pa`, Nikon, Truepic Lens) -- Pixel/Samsung/Sony/Canon/Bria are documented but NOT added until a real signed sample confirms the in-manifest string (same fixture discipline as Grok/Doubao/Ideogram). Camera C2PA marks capture authenticity, not AI, so these never set `is_ai` (that stays driven by digital-source-type). Add platform mappings to `_CLAIM_GENERATOR_PLATFORM` / `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read.
 - `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`.
 - `invisible_watermark.py` — `detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required. **Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs.
 - `trustmark_detector.py` — `detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded.
@@ -1,6 +1,6 @@
 [project]
 name = "remove-ai-watermarks"
-version = "0.6.0"
+version = "0.6.1"
 description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -1,3 +1,3 @@
 """Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks."""

-__version__ = "0.6.0"
+__version__ = "0.6.1"
@@ -34,7 +34,7 @@ from remove_ai_watermarks.metadata import (
    iptc_ai_system,
    xai_signature,
 )
-from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, soft_binding_vendors_in
+from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
 from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_ISSUERS

 if TYPE_CHECKING:
@@ -127,6 +127,35 @@ def _ai_tools_in(data: bytes) -> list[str]:
    return sorted({name for sig, name in C2PA_AI_TOOLS.items() if sig in data})


+# C2PA claim-generator substring -> platform. The claim generator names what
+# PRODUCED the asset, so it is far more reliable than byte-scanning the manifest
+# for an issuer name (which also matches incidental mentions: a timestamp
+# authority like "Truepic" in a Leica chain, an XMP-toolkit "Adobe" string in a
+# Nikon file, or "Google" in a URL -- all verified on real samples). Ordered:
+# camera tokens first so a device wins over an incidental tool name (Nikon's
+# real sample claim generator also contains "Adobe_MAX"). Camera C2PA marks
+# CAPTURE authenticity, not AI, so these never assert is_ai on their own (the
+# verdict still comes from the digital-source-type). Only tokens verified
+# against a real signed file are listed; add more as samples are captured.
+_CLAIM_GENERATOR_PLATFORM: tuple[tuple[str, str], ...] = (
+    ("lc_c2pa", "Leica (camera, C2PA capture)"),
+    ("leica", "Leica (camera, C2PA capture)"),
+    ("nikon", "Nikon (camera, C2PA capture)"),
+    ("truepic", "Truepic Lens (verified capture)"),
+)
+
+
+def _platform_from_generator(generator: str | None) -> str | None:
+    """Map a C2PA claim-generator string to a device/platform, or None."""
+    if not generator:
+        return None
+    low = generator.lower()
+    for token, platform in _CLAIM_GENERATOR_PLATFORM:
+        if token in low:
+            return platform
+    return None
+
+
 def _attribute_platform(issuers: list[str]) -> str | None:
    """Map a set of C2PA issuer names to a human-readable generating platform."""
    joined = " ".join(issuers)
@@ -205,12 +234,20 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
    # ── C2PA Content Credentials ────────────────────────────────────
    has_c2pa = bool(info) or b"c2pa" in head.lower() or C2PA_UUID in head
    issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head)
-    platform = _attribute_platform(issuers) if has_c2pa else None
    c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any(
        m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia")
    )
-    # Generator: structured for PNG, binary-scanned for other containers.
-    generator = info.get("claim_generator") or (", ".join(tools) if (tools := _ai_tools_in(head)) else None)
+    # Generator: structured for PNG, CBOR-scanned for other containers. The claim
+    # generator is the authoritative "what produced this", so it drives platform
+    # attribution; the issuer byte-scan is only the fallback (it matches
+    # incidental chain/namespace mentions -- on real samples Leica mis-read as
+    # Truepic, Nikon as Adobe, Truepic as Google until claim-generator took over).
+    generator = (
+        info.get("claim_generator")
+        or cbor_text_after(head, b"claim_generator")
+        or (", ".join(tools) if (tools := _ai_tools_in(head)) else None)
+    )
+    platform = (_platform_from_generator(generator) or _attribute_platform(issuers)) if has_c2pa else None
    if has_c2pa:
        detail = ", ".join(filter(None, [", ".join(issuers), generator, info.get("source_type")]))
        signals.append(Signal("c2pa", detail or "C2PA manifest present", "high"))
@@ -131,7 +131,7 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]:
    return c2pa_info


-def _cbor_text_after(payload: bytes, key: bytes) -> str | None:
+def cbor_text_after(payload: bytes, key: bytes) -> str | None:
    """Return the CBOR text-string immediately following ``key`` in ``payload``.

    Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte
@@ -212,9 +212,9 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
    # Guard with isprintable(): on some manifests (e.g. Microsoft Designer) the
    # first ``name`` key precedes a binary field (a hash), not the generator
    # string, which would otherwise surface as control-char garbage.
-    if (generator := _cbor_text_after(chunk_data, b"name")) and generator.isprintable():
+    if (generator := cbor_text_after(chunk_data, b"name")) and generator.isprintable():
        c2pa_info["claim_generator"] = generator
-    if (spec := _cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable():
+    if (spec := cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable():
        c2pa_info["c2pa_spec"] = spec

    # Find actions
@@ -337,6 +337,25 @@ class TestIdentifyIptcAi:
        assert "Gemini" in r.platform


+class TestIdentifyC2paClaimGenerator:
+    """C2PA attribution prefers claim_generator over incidental issuer tokens."""
+
+    def test_claim_generator_beats_incidental_tokens(self, tmp_path: Path):
+        # Real manifests mention timestamp authorities / XMP toolkits, so
+        # "Adobe"/"Google"/"Truepic" appear incidentally; the claim generator
+        # (a Leica camera) must win the platform attribution. Regression guard
+        # for the real-sample mis-attribution (Leica->Truepic, Nikon->Adobe).
+        gen = b"M11-P/2.0.1 lc_c2pa"
+        blob = (
+            b"\xff\xd8\xff\xe1 c2pa.claim jumbf Adobe Google Truepic "
+            b"claim_generator" + bytes([0x60 + len(gen)]) + gen + b" \xff\xd9"
+        )
+        p = tmp_path / "leica_like.jpg"
+        p.write_bytes(blob)
+        r = identify(p, check_visible=False, check_invisible=False)
+        assert r.platform == "Leica (camera, C2PA capture)"
+
+
 # ── Open invisible watermark (SD/SDXL/FLUX) integration ─────────────

 from remove_ai_watermarks.invisible_watermark import is_available as _wm_available  # noqa: E402
@@ -8,8 +8,8 @@ from pathlib import Path
 import pytest

 from remove_ai_watermarks.noai.c2pa import (
-    _cbor_text_after,
    _parse_c2pa_chunk,
+    cbor_text_after,
    extract_c2pa_chunk,
    extract_c2pa_info,
    has_c2pa_metadata,
@@ -193,36 +193,36 @@ class TestC2PAInjectValidation:


 class TestCborTextAfter:
-    """_cbor_text_after handles the three CBOR text-string length prefixes."""
+    """cbor_text_after handles the three CBOR text-string length prefixes."""

    def test_direct_length(self):
        # major-type 3, direct length (0x60 + len). "abc" -> 0x63.
        payload = b"name" + bytes([0x63]) + b"abc"
-        assert _cbor_text_after(payload, b"name") == "abc"
+        assert cbor_text_after(payload, b"name") == "abc"

    def test_one_byte_length(self):
        s = b"x" * 30
        payload = b"name" + bytes([0x78, 30]) + s
-        assert _cbor_text_after(payload, b"name") == "x" * 30
+        assert cbor_text_after(payload, b"name") == "x" * 30

    def test_two_byte_length(self):
        s = b"y" * 300
        payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s
-        assert _cbor_text_after(payload, b"name") == "y" * 300
+        assert cbor_text_after(payload, b"name") == "y" * 300

    def test_key_not_found_returns_none(self):
-        assert _cbor_text_after(b"nothing here", b"name") is None
+        assert cbor_text_after(b"nothing here", b"name") is None

    def test_key_at_end_returns_none(self):
-        assert _cbor_text_after(b"prefixname", b"name") is None
+        assert cbor_text_after(b"prefixname", b"name") is None

    def test_invalid_head_returns_none(self):
        # 0x00 is not a text-string head.
-        assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
+        assert cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None

    def test_latin1_fallback_on_invalid_utf8(self):
        payload = b"name" + bytes([0x61]) + b"\xff"  # len 1, invalid utf-8
-        assert _cbor_text_after(payload, b"name") is not None
+        assert cbor_text_after(payload, b"name") is not None


 class TestSynthIDVerdict:
@@ -2784,7 +2784,7 @@ wheels = [

 [[package]]
 name = "remove-ai-watermarks"
-version = "0.6.0"
+version = "0.6.1"
 source = { editable = "." }
 dependencies = [
    { name = "click" },