fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1)

Verified on real signed files that the issuer byte-scan mis-attributes
multi-entity manifests: Leica read as "Truepic" (timestamp authority in the
chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's
"Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the
claim generator (what produced the asset) and falls back to the issuer scan.

- New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator
  read for non-PNG via the now-public c2pa.cbor_text_after.
- Device tokens listed only where verified against a real C2PA file (Leica
  lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a
  real sample confirms the in-manifest string. Camera C2PA marks capture
  authenticity, so these never set is_ai.
- cbor_text_after made public (was _cbor_text_after); call sites + tests updated.
- Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
test-user
2026-05-26 20:10:07 -07:00
parent 2676325184
commit dda2ee7fbb
8 changed files with 76 additions and 20 deletions
+1 -1
View File
@@ -28,7 +28,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
- `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path).
- `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new issuer/binding here, not inline.
- `metadata.py``synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: <base64>` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output.
- `identify.py``identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. Add platform mappings to `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read.
- `identify.py``identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, visible Gemini sparkle, open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. **C2PA platform attribution is claim-generator-first, issuer-scan fallback** (`_CLAIM_GENERATOR_PLATFORM` then `_ISSUER_PLATFORM`): the claim generator (via the now-public `c2pa.cbor_text_after(head, b"claim_generator")` for non-PNG) names what *produced* the asset, while the issuer byte-scan matches incidental chain mentions. **Verified on real signed files 2026-05-26:** the old issuer-only scan mis-attributed Leica→"Truepic" (timestamp authority), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), and Truepic→"Google"; claim-generator-first fixes all three. `_CLAIM_GENERATOR_PLATFORM` only lists device tokens **verified against a real C2PA file** (Leica `lc_c2pa`, Nikon, Truepic Lens) -- Pixel/Samsung/Sony/Canon/Bria are documented but NOT added until a real signed sample confirms the in-manifest string (same fixture discipline as Grok/Doubao/Ideogram). Camera C2PA marks capture authenticity, not AI, so these never set `is_ai` (that stays driven by digital-source-type). Add platform mappings to `_CLAIM_GENERATOR_PLATFORM` / `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read.
- `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`.
- `invisible_watermark.py``detect_invisible_watermark(path)` decodes the OPEN DWT-DCT watermarks (public decoder, no key) embedded by Stable Diffusion / SDXL / FLUX via the `imwatermark` library. Known fixed patterns (verified against upstream source) live in `_BITS_48` (SDXL 48-bit, FLUX.2 48-bit) and `_SD1_STRING` ("StableDiffusionV1", SD 1.x/2.x). Optional dep (extra `detect`); returns None when absent. The `detect` extra pulls **torch** transitively (invisible-watermark declares torch a hard dep, and `WatermarkDecoder` eagerly imports `rivaGan` -> `torch` at import time), so detection needs torch present even though dwtDct runs CPU-only on cv2/numpy/pywavelets — no GPU and no separate `gpu` extra required. **Unlike SynthID this is locally detectable**, but the watermark is fragile (does not survive JPEG re-encode/resize — verified gone after JPEG q90), so it confirms origin only on pristine files. Add new known patterns here. The file carries a top-of-module pyright pragma because imwatermark/cv2 ship no type stubs.
- `trustmark_detector.py``detect_trustmark(path)` decodes the OPEN, keyless **Adobe TrustMark** watermark (the soft binding behind Adobe Durable Content Credentials, `alg` `com.adobe.trustmark.P`) via the optional `trustmark` package (extra `trustmark`; pulls torch, downloads model weights on first use). Mirrors `invisible_watermark.py` (lazy singleton, top-of-module pyright pragma, returns None when absent). It detects *provenance*, not AI origin as such (TrustMark also marks human-authored content), so `identify` lists it as a watermark without setting `is_ai_generated`. Other soft-binding vendors (Digimarc/Imatag/Steg.AI/...) have no public decoder — they are only *named* via the `C2PA_SOFT_BINDINGS` scan, not decoded.
+1 -1
View File
@@ -1,6 +1,6 @@
[project]
name = "remove-ai-watermarks"
version = "0.6.0"
version = "0.6.1"
description = "Remove visible and invisible AI watermarks from images (Gemini / Nano Banana, ChatGPT, Stable Diffusion)"
readme = "README.md"
requires-python = ">=3.10"
+1 -1
View File
@@ -1,3 +1,3 @@
"""Remove-AI-Watermarks: Unified tool for removing visible and invisible AI watermarks."""
__version__ = "0.6.0"
__version__ = "0.6.1"
+41 -4
View File
@@ -34,7 +34,7 @@ from remove_ai_watermarks.metadata import (
iptc_ai_system,
xai_signature,
)
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, soft_binding_vendors_in
from remove_ai_watermarks.noai.c2pa import cbor_text_after, extract_c2pa_info, soft_binding_vendors_in
from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_ISSUERS
if TYPE_CHECKING:
@@ -127,6 +127,35 @@ def _ai_tools_in(data: bytes) -> list[str]:
return sorted({name for sig, name in C2PA_AI_TOOLS.items() if sig in data})
# C2PA claim-generator substring -> platform. The claim generator names what
# PRODUCED the asset, so it is far more reliable than byte-scanning the manifest
# for an issuer name (which also matches incidental mentions: a timestamp
# authority like "Truepic" in a Leica chain, an XMP-toolkit "Adobe" string in a
# Nikon file, or "Google" in a URL -- all verified on real samples). Ordered:
# camera tokens first so a device wins over an incidental tool name (Nikon's
# real sample claim generator also contains "Adobe_MAX"). Camera C2PA marks
# CAPTURE authenticity, not AI, so these never assert is_ai on their own (the
# verdict still comes from the digital-source-type). Only tokens verified
# against a real signed file are listed; add more as samples are captured.
_CLAIM_GENERATOR_PLATFORM: tuple[tuple[str, str], ...] = (
("lc_c2pa", "Leica (camera, C2PA capture)"),
("leica", "Leica (camera, C2PA capture)"),
("nikon", "Nikon (camera, C2PA capture)"),
("truepic", "Truepic Lens (verified capture)"),
)
def _platform_from_generator(generator: str | None) -> str | None:
"""Map a C2PA claim-generator string to a device/platform, or None."""
if not generator:
return None
low = generator.lower()
for token, platform in _CLAIM_GENERATOR_PLATFORM:
if token in low:
return platform
return None
def _attribute_platform(issuers: list[str]) -> str | None:
"""Map a set of C2PA issuer names to a human-readable generating platform."""
joined = " ".join(issuers)
@@ -205,12 +234,20 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
# ── C2PA Content Credentials ────────────────────────────────────
has_c2pa = bool(info) or b"c2pa" in head.lower() or C2PA_UUID in head
issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head)
platform = _attribute_platform(issuers) if has_c2pa else None
c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any(
m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia")
)
# Generator: structured for PNG, binary-scanned for other containers.
generator = info.get("claim_generator") or (", ".join(tools) if (tools := _ai_tools_in(head)) else None)
# Generator: structured for PNG, CBOR-scanned for other containers. The claim
# generator is the authoritative "what produced this", so it drives platform
# attribution; the issuer byte-scan is only the fallback (it matches
# incidental chain/namespace mentions -- on real samples Leica mis-read as
# Truepic, Nikon as Adobe, Truepic as Google until claim-generator took over).
generator = (
info.get("claim_generator")
or cbor_text_after(head, b"claim_generator")
or (", ".join(tools) if (tools := _ai_tools_in(head)) else None)
)
platform = (_platform_from_generator(generator) or _attribute_platform(issuers)) if has_c2pa else None
if has_c2pa:
detail = ", ".join(filter(None, [", ".join(issuers), generator, info.get("source_type")]))
signals.append(Signal("c2pa", detail or "C2PA manifest present", "high"))
+3 -3
View File
@@ -131,7 +131,7 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]:
return c2pa_info
def _cbor_text_after(payload: bytes, key: bytes) -> str | None:
def cbor_text_after(payload: bytes, key: bytes) -> str | None:
"""Return the CBOR text-string immediately following ``key`` in ``payload``.
Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte
@@ -212,9 +212,9 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
# Guard with isprintable(): on some manifests (e.g. Microsoft Designer) the
# first ``name`` key precedes a binary field (a hash), not the generator
# string, which would otherwise surface as control-char garbage.
if (generator := _cbor_text_after(chunk_data, b"name")) and generator.isprintable():
if (generator := cbor_text_after(chunk_data, b"name")) and generator.isprintable():
c2pa_info["claim_generator"] = generator
if (spec := _cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable():
if (spec := cbor_text_after(chunk_data, b"specVersion")) and spec.isprintable():
c2pa_info["c2pa_spec"] = spec
# Find actions
+19
View File
@@ -337,6 +337,25 @@ class TestIdentifyIptcAi:
assert "Gemini" in r.platform
class TestIdentifyC2paClaimGenerator:
"""C2PA attribution prefers claim_generator over incidental issuer tokens."""
def test_claim_generator_beats_incidental_tokens(self, tmp_path: Path):
# Real manifests mention timestamp authorities / XMP toolkits, so
# "Adobe"/"Google"/"Truepic" appear incidentally; the claim generator
# (a Leica camera) must win the platform attribution. Regression guard
# for the real-sample mis-attribution (Leica->Truepic, Nikon->Adobe).
gen = b"M11-P/2.0.1 lc_c2pa"
blob = (
b"\xff\xd8\xff\xe1 c2pa.claim jumbf Adobe Google Truepic "
b"claim_generator" + bytes([0x60 + len(gen)]) + gen + b" \xff\xd9"
)
p = tmp_path / "leica_like.jpg"
p.write_bytes(blob)
r = identify(p, check_visible=False, check_invisible=False)
assert r.platform == "Leica (camera, C2PA capture)"
# ── Open invisible watermark (SD/SDXL/FLUX) integration ─────────────
from remove_ai_watermarks.invisible_watermark import is_available as _wm_available # noqa: E402
+9 -9
View File
@@ -8,8 +8,8 @@ from pathlib import Path
import pytest
from remove_ai_watermarks.noai.c2pa import (
_cbor_text_after,
_parse_c2pa_chunk,
cbor_text_after,
extract_c2pa_chunk,
extract_c2pa_info,
has_c2pa_metadata,
@@ -193,36 +193,36 @@ class TestC2PAInjectValidation:
class TestCborTextAfter:
"""_cbor_text_after handles the three CBOR text-string length prefixes."""
"""cbor_text_after handles the three CBOR text-string length prefixes."""
def test_direct_length(self):
# major-type 3, direct length (0x60 + len). "abc" -> 0x63.
payload = b"name" + bytes([0x63]) + b"abc"
assert _cbor_text_after(payload, b"name") == "abc"
assert cbor_text_after(payload, b"name") == "abc"
def test_one_byte_length(self):
s = b"x" * 30
payload = b"name" + bytes([0x78, 30]) + s
assert _cbor_text_after(payload, b"name") == "x" * 30
assert cbor_text_after(payload, b"name") == "x" * 30
def test_two_byte_length(self):
s = b"y" * 300
payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s
assert _cbor_text_after(payload, b"name") == "y" * 300
assert cbor_text_after(payload, b"name") == "y" * 300
def test_key_not_found_returns_none(self):
assert _cbor_text_after(b"nothing here", b"name") is None
assert cbor_text_after(b"nothing here", b"name") is None
def test_key_at_end_returns_none(self):
assert _cbor_text_after(b"prefixname", b"name") is None
assert cbor_text_after(b"prefixname", b"name") is None
def test_invalid_head_returns_none(self):
# 0x00 is not a text-string head.
assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
assert cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
def test_latin1_fallback_on_invalid_utf8(self):
payload = b"name" + bytes([0x61]) + b"\xff" # len 1, invalid utf-8
assert _cbor_text_after(payload, b"name") is not None
assert cbor_text_after(payload, b"name") is not None
class TestSynthIDVerdict:
Generated
+1 -1
View File
@@ -2784,7 +2784,7 @@ wheels = [
[[package]]
name = "remove-ai-watermarks"
version = "0.6.0"
version = "0.6.1"
source = { editable = "." }
dependencies = [
{ name = "click" },