mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-05-26 14:17:47 +02:00
feat(identify): provenance command (platform + watermark inventory)
New 'identify' command and identify.py module: upload an image, get one ProvenanceReport answering where it was made and what watermarks it carries. Aggregates every locally-readable signal: - C2PA Content Credentials -> generating platform (issuer + generator). - IPTC digitalSourceType 'Made with AI' (Meta and others). - Embedded SD/ComfyUI generation parameters (local pipelines). - SynthID metadata proxy (Google / OpenAI C2PA companion). - Visible Gemini sparkle (cv2 fallback for the stripped-metadata case), promoted only at confidence >= 0.5 (corpus-tuned: Gemini sparkles score >= 0.56, non-sparkle <= 0.49). is_ai_generated is True or None, never asserted False -- stripped metadata leaves no local proof of a clean origin, so absence of signals is reported as 'unknown' with an explicit caveat. The SynthID *pixel* watermark remains locally undecodable; the report says so. Non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) get the same issuer + generator attribution via a binary scan (the caBX parser is PNG-only). The cv2 dependency is isolated in gemini_engine.detect_sparkle_confidence so identify.py stays type-clean. CLI supports --json and --no-visible. Validated against the 109-image corpus: 14/14 positives flagged AI, 93/94 negatives clean (the one 'neg' flagged is a Meta image that genuinely carries the IPTC tag -- correct), zero true errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -5,6 +5,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
|
||||
## How to run
|
||||
|
||||
- `uv run remove-ai-watermarks all <image.png> -o <output.png>`
|
||||
- `uv run remove-ai-watermarks identify <image>` — provenance verdict (platform + watermark inventory + confidence); `--json` for machine output, `--no-visible` to skip the cv2 sparkle detector
|
||||
- `uv run remove-ai-watermarks metadata <image.png> --check` — inspect AI metadata (C2PA, EXIF, PNG chunks)
|
||||
- `uv run remove-ai-watermarks metadata <image.png> --remove -o <out.png>` — strip all AI metadata
|
||||
|
||||
@@ -25,7 +26,9 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
|
||||
|
||||
- `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor.
|
||||
- `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, and `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI). Add a new issuer here, not inline.
|
||||
- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
|
||||
- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan.
|
||||
- `identify.py` — `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, IPTC "Made with AI", embedded SD/ComfyUI params, SynthID proxy, visible Gemini sparkle) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). Visible-sparkle is promoted only at confidence ≥ `_SPARKLE_THRESHOLD` (0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49). The cv2 dependency lives in `gemini_engine.detect_sparkle_confidence`, not here. Add platform mappings to `_ISSUER_PLATFORM`, not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB; EXIF/XMP *fields* inside ISOBMFF are still not parsed (no positive fixtures to validate against).
|
||||
- `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`.
|
||||
- `face_protector.py` — YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features
|
||||
|
||||
## Known limitations
|
||||
|
||||
@@ -180,6 +180,13 @@ remove-ai-watermarks batch ./images/ --mode all
|
||||
#### Individual commands
|
||||
|
||||
```bash
|
||||
# Identify provenance: where an image was made + its watermark inventory.
|
||||
# Aggregates C2PA, IPTC "Made with AI", embedded SD/ComfyUI params, the
|
||||
# SynthID proxy, and the visible Gemini sparkle into one verdict. Reports
|
||||
# "unknown" (never "clean") when no signal is found, since stripped metadata
|
||||
# is not proof of a clean origin. Add --json for machine-readable output.
|
||||
remove-ai-watermarks identify image.png
|
||||
|
||||
# Visible watermark only (Gemini / Nano Banana sparkle) — fast, offline
|
||||
remove-ai-watermarks visible image.png -o clean.png
|
||||
|
||||
|
||||
@@ -8,6 +8,7 @@ Provides commands for:
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
@@ -329,6 +330,55 @@ def cmd_metadata(
|
||||
console.print(f" [green]✓[/] AI metadata stripped → {out}")
|
||||
|
||||
|
||||
# ── Provenance identification ───────────────────────────────────────
|
||||
|
||||
|
||||
@main.command("identify")
|
||||
@click.argument("source", type=click.Path(exists=True, path_type=Path))
|
||||
@click.option("--no-visible", is_flag=True, help="Skip the visible-sparkle detector (metadata-only, no cv2).")
|
||||
@click.option("--json", "as_json", is_flag=True, help="Emit the report as JSON instead of a table.")
|
||||
@click.pass_context
|
||||
def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bool) -> None:
|
||||
"""Identify where an image was made and what watermarks it carries.
|
||||
|
||||
Aggregates C2PA Content Credentials, IPTC "Made with AI" tags, embedded
|
||||
generation parameters, the SynthID metadata proxy, and the visible Gemini
|
||||
sparkle into a single provenance verdict. Absence of signals is reported as
|
||||
"unknown", never as "clean" (stripped metadata leaves no local proof).
|
||||
"""
|
||||
from dataclasses import asdict
|
||||
|
||||
from remove_ai_watermarks.identify import identify
|
||||
|
||||
source = _validate_image(source)
|
||||
report = identify(source, check_visible=not no_visible)
|
||||
|
||||
if as_json:
|
||||
click.echo(json.dumps(asdict(report), default=str, indent=2))
|
||||
return
|
||||
|
||||
_banner()
|
||||
verdict = {True: "[yellow]AI-generated[/]", False: "[green]not AI[/]", None: "[dim]unknown[/]"}[
|
||||
report.is_ai_generated
|
||||
]
|
||||
console.print(f"\n Verdict: {verdict} [dim](confidence: {report.confidence})[/]")
|
||||
console.print(f" Platform: {report.platform or '[dim]undetermined[/]'}")
|
||||
|
||||
if report.watermarks:
|
||||
table = Table(show_header=True, header_style="bold", title="Watermarks / provenance markers")
|
||||
table.add_column("Marker", style="cyan")
|
||||
for wm in report.watermarks:
|
||||
table.add_row(wm)
|
||||
console.print(table)
|
||||
else:
|
||||
console.print(" [dim]No watermarks or provenance markers found.[/]")
|
||||
|
||||
if report.caveats:
|
||||
console.print("\n [dim]Caveats:[/]")
|
||||
for c in report.caveats:
|
||||
console.print(f" [dim]- {c}[/]")
|
||||
|
||||
|
||||
# ── Combined "all" mode ──────────────────────────────────────────────
|
||||
|
||||
|
||||
|
||||
@@ -546,3 +546,17 @@ class GeminiEngine:
|
||||
result[py1:py2, px1:px2] = blended.astype(np.uint8)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def detect_sparkle_confidence(image_path: Path) -> float | None:
|
||||
"""Visible-sparkle detection confidence for a file, for provenance use.
|
||||
|
||||
Loads the image with cv2 and runs :meth:`GeminiEngine.detect_watermark`.
|
||||
Returns the NCC confidence in [0, 1], or None if the image cannot be read
|
||||
(cv2 returns None for unsupported containers such as HEIC). Kept here so the
|
||||
cv2 dependency stays in this module; callers apply their own threshold.
|
||||
"""
|
||||
img = cv2.imread(str(image_path))
|
||||
if img is None:
|
||||
return None
|
||||
return float(GeminiEngine().detect_watermark(img).confidence)
|
||||
|
||||
@@ -0,0 +1,243 @@
|
||||
"""Image provenance: identify where an image was made and what watermarks it carries.
|
||||
|
||||
Aggregates every locally-readable signal into a single :class:`ProvenanceReport`:
|
||||
|
||||
- **C2PA Content Credentials** (issuer, claim generator, digital source type) ->
|
||||
the signing platform (OpenAI, Google, Adobe, Microsoft).
|
||||
- **IPTC ``digitalSourceType``** "Made with AI" marker (Meta, X, others).
|
||||
- **PNG text / EXIF generation parameters** (Stable Diffusion, ComfyUI, InvokeAI).
|
||||
- **SynthID metadata proxy** -- a C2PA companion from a SynthID-using vendor
|
||||
(Google / OpenAI) implies the invisible pixel watermark.
|
||||
- **Visible Gemini sparkle** (optional; needs cv2/numpy, no GPU).
|
||||
|
||||
Hard limit: a stripped image (re-encoded, screenshotted, social-media upload)
|
||||
loses all metadata, and the SynthID *pixel* watermark is not locally decodable
|
||||
(proprietary decoder). Absence of signals is therefore reported as ``Unknown``,
|
||||
never as "clean". See CLAUDE.md "SynthID detection is metadata-only".
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from remove_ai_watermarks.metadata import (
|
||||
AI_METADATA_KEYS,
|
||||
C2PA_UUID,
|
||||
IPTC_AI_MARKERS,
|
||||
get_ai_metadata,
|
||||
)
|
||||
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info
|
||||
from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_ISSUERS
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from pathlib import Path
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# How much of a non-PNG container to binary-scan for the C2PA issuer.
|
||||
_SCAN_BYTES = 1024 * 1024
|
||||
|
||||
# Visible-sparkle confidence above which the signal is trusted as provenance.
|
||||
# Stricter than the removal default (0.25): on the corpus, Gemini-family
|
||||
# sparkles score >= 0.56 while non-sparkle images top out at 0.49, so 0.5
|
||||
# cleanly separates them and avoids false positives when sparkle is the only
|
||||
# signal (e.g. an OpenAI image scored 0.37 -- below threshold, correctly dropped).
|
||||
_SPARKLE_THRESHOLD = 0.5
|
||||
|
||||
# Issuer (C2PA signer) -> human-readable generating platform. Ordered: when a
|
||||
# manifest names several issuers (Microsoft Designer signs as "OpenAI,
|
||||
# Microsoft"), the first match wins so the product, not the backend, is named.
|
||||
_ISSUER_PLATFORM: tuple[tuple[str, str], ...] = (
|
||||
("Microsoft", "Microsoft Designer (DALL-E / OpenAI backend)"),
|
||||
("Adobe", "Adobe Firefly"),
|
||||
("OpenAI", "OpenAI (ChatGPT / gpt-image / DALL-E / Sora)"),
|
||||
("Google", "Google (Gemini / Imagen)"),
|
||||
)
|
||||
|
||||
# PNG-text / EXIF keys that indicate a local diffusion pipeline (vs. a hosted
|
||||
# platform's C2PA). Subset of AI_METADATA_KEYS; excludes the C2PA/Software keys.
|
||||
_LOCAL_GEN_KEYS = frozenset(
|
||||
AI_METADATA_KEYS & {"parameters", "prompt", "negative_prompt", "workflow", "comfyui", "invokeai_metadata", "dream"}
|
||||
)
|
||||
|
||||
_STRIP_CAVEAT = (
|
||||
"Absence of metadata is not proof the image is clean: C2PA, EXIF, and PNG "
|
||||
"text chunks are stripped by re-encoding, screenshots, or social-media upload."
|
||||
)
|
||||
_SYNTHID_CAVEAT = (
|
||||
"SynthID is a metadata proxy here; the pixel watermark is not locally "
|
||||
"verifiable (proprietary decoder). Confirm via the Gemini app or openai.com/verify."
|
||||
)
|
||||
_OPENAI_CAVEAT = (
|
||||
"OpenAI began pairing SynthID with C2PA around 2026-05; OpenAI images from "
|
||||
"before the rollout carry C2PA without SynthID, so the SynthID verdict is 'likely'."
|
||||
)
|
||||
_IPTC_ONLY_CAVEAT = "The IPTC 'Made with AI' tag flags AI provenance but does not identify the specific platform."
|
||||
|
||||
|
||||
@dataclass
|
||||
class Signal:
|
||||
"""A single provenance signal that was found (or affirmatively absent)."""
|
||||
|
||||
name: str
|
||||
detail: str
|
||||
confidence: str # "high" | "medium"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ProvenanceReport:
|
||||
"""Aggregated provenance verdict for one image."""
|
||||
|
||||
path: Path
|
||||
is_ai_generated: bool | None # True / False is never asserted; None = unknown
|
||||
platform: str | None
|
||||
confidence: str # "high" | "medium" | "none"
|
||||
watermarks: list[str] = field(default_factory=list[str])
|
||||
signals: list[Signal] = field(default_factory=list["Signal"])
|
||||
caveats: list[str] = field(default_factory=list[str])
|
||||
|
||||
|
||||
def _issuers_in(data: bytes) -> list[str]:
|
||||
"""C2PA issuer names whose signature byte appears in ``data`` (binary scan)."""
|
||||
return sorted({name for sig, name in C2PA_ISSUERS.items() if sig in data})
|
||||
|
||||
|
||||
def _ai_tools_in(data: bytes) -> list[str]:
|
||||
"""Known C2PA AI-tool / generator names appearing in ``data`` (binary scan).
|
||||
|
||||
PNG has a structured claim_generator; for JPEG/WebP/AVIF/HEIF/JXL the
|
||||
generator lives in a JUMBF/EXIF/XMP blob the PNG parser can't reach, so a
|
||||
byte scan recovers the same attribution (e.g. "Imagen", "DALL-E").
|
||||
"""
|
||||
return sorted({name for sig, name in C2PA_AI_TOOLS.items() if sig in data})
|
||||
|
||||
|
||||
def _attribute_platform(issuers: list[str]) -> str | None:
|
||||
"""Map a set of C2PA issuer names to a human-readable generating platform."""
|
||||
joined = " ".join(issuers)
|
||||
for needle, platform in _ISSUER_PLATFORM:
|
||||
if needle in joined:
|
||||
return platform
|
||||
if issuers: # e.g. Truepic alone -- a signing authority, not a generator
|
||||
return f"C2PA signer: {', '.join(issuers)} (no known AI generator named)"
|
||||
return None
|
||||
|
||||
|
||||
def _visible_sparkle(image_path: Path) -> float | None:
|
||||
"""Visible Gemini-sparkle confidence in [0, 1], or None if unavailable.
|
||||
|
||||
Optional: needs cv2/numpy (no GPU). The cv2 work lives in gemini_engine so
|
||||
this module stays dependency-light; returns None if cv2 or the engine
|
||||
assets are missing, or the image can't be read.
|
||||
"""
|
||||
try:
|
||||
from remove_ai_watermarks.gemini_engine import detect_sparkle_confidence
|
||||
except Exception as exc: # cv2/engine assets missing
|
||||
log.debug("visible-sparkle detector unavailable: %s", exc)
|
||||
return None
|
||||
return detect_sparkle_confidence(image_path)
|
||||
|
||||
|
||||
def identify(image_path: Path, *, check_visible: bool = True) -> ProvenanceReport:
|
||||
"""Identify an image's origin platform and watermark inventory.
|
||||
|
||||
Args:
|
||||
image_path: Path to the image (PNG, JPEG, WebP, or ISOBMFF container).
|
||||
check_visible: Also run the visible Gemini-sparkle detector (cv2). Set
|
||||
False for a pure-metadata, dependency-light scan.
|
||||
|
||||
Returns:
|
||||
A :class:`ProvenanceReport`. ``is_ai_generated`` is True when any AI
|
||||
signal is found and None (unknown) when none is -- it is never asserted
|
||||
False, because stripped metadata leaves no local proof of a clean origin.
|
||||
"""
|
||||
info = extract_c2pa_info(image_path) # PNG-structured; {} for other formats
|
||||
meta = get_ai_metadata(image_path) # PNG text + EXIF + C2PA fields + synthid
|
||||
|
||||
# First MB covers C2PA (PNG caBX, JPEG APP11, AVIF/HEIF/JXL uuid box) and
|
||||
# IPTC markers for the non-PNG path where extract_c2pa_info returns {}.
|
||||
with open(image_path, "rb") as f:
|
||||
head = f.read(_SCAN_BYTES)
|
||||
|
||||
signals: list[Signal] = []
|
||||
watermarks: list[str] = []
|
||||
caveats: list[str] = []
|
||||
|
||||
# ── C2PA Content Credentials ────────────────────────────────────
|
||||
has_c2pa = bool(info) or b"c2pa" in head.lower() or C2PA_UUID in head
|
||||
issuers = [info["issuer"]] if info.get("issuer") else _issuers_in(head)
|
||||
platform = _attribute_platform(issuers) if has_c2pa else None
|
||||
c2pa_is_ai = "trainedAlgorithmicMedia" in info.get("source_type", "") or any(
|
||||
m in head for m in (b"trainedAlgorithmicMedia", b"compositeWithTrainedAlgorithmicMedia")
|
||||
)
|
||||
# Generator: structured for PNG, binary-scanned for other containers.
|
||||
generator = info.get("claim_generator") or (", ".join(tools) if (tools := _ai_tools_in(head)) else None)
|
||||
if has_c2pa:
|
||||
detail = ", ".join(filter(None, [", ".join(issuers), generator, info.get("source_type")]))
|
||||
signals.append(Signal("c2pa", detail or "C2PA manifest present", "high"))
|
||||
watermarks.append(f"C2PA Content Credentials ({', '.join(issuers) or 'unknown signer'})")
|
||||
|
||||
# ── SynthID metadata proxy ──────────────────────────────────────
|
||||
# get_ai_metadata already sets synthid_watermark for both PNG (caBX parser)
|
||||
# and non-PNG (its own synthid_source fallback), so no extra scan is needed.
|
||||
synthid = meta.get("synthid_watermark")
|
||||
if synthid:
|
||||
watermarks.append(f"SynthID pixel watermark ({synthid})")
|
||||
caveats.append(_SYNTHID_CAVEAT)
|
||||
if "OpenAI" in (" ".join(issuers) + synthid):
|
||||
caveats.append(_OPENAI_CAVEAT)
|
||||
|
||||
# ── IPTC "Made with AI" (Meta etc.), only meaningful without C2PA ─
|
||||
iptc = any(m in head for m in IPTC_AI_MARKERS)
|
||||
if iptc and not has_c2pa:
|
||||
signals.append(Signal("iptc", "digitalSourceType (Made with AI)", "high"))
|
||||
watermarks.append("IPTC digitalSourceType (Made with AI)")
|
||||
caveats.append(_IPTC_ONLY_CAVEAT)
|
||||
if platform is None:
|
||||
platform = "Made-with-AI tag (e.g. Meta AI); platform not specified"
|
||||
|
||||
# ── Local diffusion parameters (Stable Diffusion / ComfyUI) ──────
|
||||
local_keys = sorted(k for k in meta if k.lower() in _LOCAL_GEN_KEYS)
|
||||
if local_keys:
|
||||
signals.append(Signal("gen_params", f"embedded keys: {', '.join(local_keys)}", "high"))
|
||||
watermarks.append("Embedded generation parameters (Stable Diffusion / ComfyUI)")
|
||||
if platform is None:
|
||||
platform = "Stable Diffusion / local pipeline (Automatic1111, ComfyUI, InvokeAI)"
|
||||
|
||||
# ── Verdict so far (metadata) ───────────────────────────────────
|
||||
ai_from_metadata = bool((has_c2pa and (c2pa_is_ai or synthid)) or iptc or local_keys)
|
||||
|
||||
# ── Visible Gemini sparkle (fallback for stripped-metadata case) ─
|
||||
if check_visible and (conf := _visible_sparkle(image_path)) is not None and conf >= _SPARKLE_THRESHOLD:
|
||||
signals.append(Signal("visible_sparkle", f"NCC confidence {conf:.2f}", "medium"))
|
||||
watermarks.append(f"Visible Gemini sparkle (confidence {conf:.2f})")
|
||||
if platform is None:
|
||||
platform = "Google Gemini family (visible sparkle detected)"
|
||||
|
||||
visible_only = any(s.name == "visible_sparkle" for s in signals) and not ai_from_metadata
|
||||
|
||||
if ai_from_metadata:
|
||||
is_ai: bool | None = True
|
||||
confidence = "high"
|
||||
elif visible_only:
|
||||
is_ai = True
|
||||
confidence = "medium"
|
||||
else:
|
||||
is_ai = None
|
||||
confidence = "none"
|
||||
|
||||
caveats.append(_STRIP_CAVEAT)
|
||||
# De-duplicate while preserving order.
|
||||
caveats = list(dict.fromkeys(caveats))
|
||||
|
||||
return ProvenanceReport(
|
||||
path=image_path,
|
||||
is_ai_generated=is_ai,
|
||||
platform=platform,
|
||||
confidence=confidence,
|
||||
watermarks=watermarks,
|
||||
signals=signals,
|
||||
caveats=caveats,
|
||||
)
|
||||
@@ -0,0 +1,136 @@
|
||||
"""Tests for the provenance identifier (identify.py).
|
||||
|
||||
Pure attribution logic is unit-tested directly; end-to-end verdicts assert
|
||||
against the real committed C2PA / IPTC fixtures in data/samples/.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from remove_ai_watermarks.identify import (
|
||||
ProvenanceReport,
|
||||
_ai_tools_in,
|
||||
_attribute_platform,
|
||||
_issuers_in,
|
||||
identify,
|
||||
)
|
||||
|
||||
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
|
||||
|
||||
|
||||
# ── Pure attribution logic (no file IO) ─────────────────────────────
|
||||
|
||||
|
||||
class TestAttributePlatform:
|
||||
def test_openai(self):
|
||||
assert "OpenAI" in (_attribute_platform(["OpenAI"]) or "")
|
||||
|
||||
def test_designer_wins_over_openai_backend(self):
|
||||
# Microsoft Designer signs as "OpenAI, Microsoft"; name the product.
|
||||
platform = _attribute_platform(["OpenAI", "Microsoft"])
|
||||
assert platform
|
||||
assert "Designer" in platform
|
||||
|
||||
def test_adobe(self):
|
||||
assert _attribute_platform(["Adobe"]) == "Adobe Firefly"
|
||||
|
||||
def test_google(self):
|
||||
assert "Google" in (_attribute_platform(["Google LLC"]) or "")
|
||||
|
||||
def test_truepic_is_signer_not_generator(self):
|
||||
platform = _attribute_platform(["Truepic"])
|
||||
assert platform
|
||||
assert "signer" in platform.lower()
|
||||
|
||||
def test_empty_is_none(self):
|
||||
assert _attribute_platform([]) is None
|
||||
|
||||
|
||||
class TestIssuersIn:
|
||||
def test_finds_openai(self):
|
||||
assert _issuers_in(b"...OpenAI...trainedAlgorithmicMedia") == ["OpenAI"]
|
||||
|
||||
def test_finds_multiple_sorted(self):
|
||||
assert _issuers_in(b"Microsoft and OpenAI") == ["Microsoft", "OpenAI"]
|
||||
|
||||
def test_none_present(self):
|
||||
assert _issuers_in(b"just some bytes") == []
|
||||
|
||||
|
||||
class TestAiToolsIn:
|
||||
def test_finds_generator(self):
|
||||
assert _ai_tools_in(b"...claim_generator Imagen 3...") == ["Imagen"]
|
||||
|
||||
def test_none_present(self):
|
||||
assert _ai_tools_in(b"a regular photo, no tools") == []
|
||||
|
||||
|
||||
class TestIdentifyNonPng:
|
||||
"""Non-PNG containers (JPEG/WebP/AVIF) carry C2PA where the caBX parser can't
|
||||
reach; identify recovers issuer + generator via the binary scan. Synthetic
|
||||
byte blobs mirror tests/test_metadata.py::TestSynthIDSourceNonPng.
|
||||
"""
|
||||
|
||||
def _c2pa_jpeg(self, tmp_path: Path, blob: bytes) -> Path:
|
||||
path = tmp_path / "img.jpg"
|
||||
path.write_bytes(b"\xff\xd8\xff\xe1jumbc2pa" + blob + b"\xff\xd9")
|
||||
return path
|
||||
|
||||
def test_google_imagen_jpeg(self, tmp_path: Path):
|
||||
path = self._c2pa_jpeg(tmp_path, b"Google Imagen ... trainedAlgorithmicMedia")
|
||||
r = identify(path, check_visible=False)
|
||||
assert r.is_ai_generated is True
|
||||
assert r.platform is not None
|
||||
assert "Google" in r.platform
|
||||
# Generator recovered from the non-PNG blob shows up in the c2pa signal.
|
||||
c2pa_signal = next(s for s in r.signals if s.name == "c2pa")
|
||||
assert "Imagen" in c2pa_signal.detail
|
||||
|
||||
def test_openai_jpeg_has_synthid(self, tmp_path: Path):
|
||||
path = self._c2pa_jpeg(tmp_path, b"OpenAI DALL-E ... trainedAlgorithmicMedia")
|
||||
r = identify(path, check_visible=False)
|
||||
assert any("SynthID" in w for w in r.watermarks)
|
||||
|
||||
|
||||
# ── End-to-end verdicts on real fixtures ────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
|
||||
class TestIdentifyRealSamples:
|
||||
def test_openai_chatgpt(self):
|
||||
r = identify(SAMPLES_DIR / "chatgpt-1.png", check_visible=False)
|
||||
assert r.is_ai_generated is True
|
||||
assert r.confidence == "high"
|
||||
assert r.platform
|
||||
assert "OpenAI" in r.platform
|
||||
assert any("C2PA" in w for w in r.watermarks)
|
||||
assert any("SynthID" in w for w in r.watermarks)
|
||||
|
||||
def test_adobe_firefly_has_no_synthid(self):
|
||||
r = identify(SAMPLES_DIR / "firefly-1.png", check_visible=False)
|
||||
assert r.is_ai_generated is True
|
||||
assert r.platform == "Adobe Firefly"
|
||||
assert not any("SynthID" in w for w in r.watermarks)
|
||||
|
||||
def test_iptc_made_with_ai(self):
|
||||
# mj-1.png carries the IPTC digitalSourceType "Made with AI" marker.
|
||||
r = identify(SAMPLES_DIR / "mj-1.png", check_visible=False)
|
||||
assert r.is_ai_generated is True
|
||||
assert any("IPTC" in w for w in r.watermarks)
|
||||
|
||||
def test_clean_photo_is_unknown_not_clean(self):
|
||||
r = identify(SAMPLES_DIR / "not-ai-1.jpeg", check_visible=False)
|
||||
assert r.is_ai_generated is None # never asserted False
|
||||
assert r.platform is None
|
||||
assert r.confidence == "none"
|
||||
assert r.watermarks == []
|
||||
|
||||
def test_strip_caveat_always_present(self):
|
||||
r = identify(SAMPLES_DIR / "not-ai-1.jpeg", check_visible=False)
|
||||
assert any("not proof" in c for c in r.caveats)
|
||||
|
||||
def test_returns_report_dataclass(self):
|
||||
assert isinstance(identify(SAMPLES_DIR / "firefly-1.png", check_visible=False), ProvenanceReport)
|
||||
Reference in New Issue
Block a user