mirror of
https://github.com/wiltodelta/remove-ai-watermarks.git
synced 2026-05-26 14:17:47 +02:00
feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").
- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
(fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
(manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
container stripping, all IPTC AI markers, and invisible watermark strength
tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).
Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -21,6 +21,9 @@ Thumbs.db
|
||||
# Test results
|
||||
data/results/
|
||||
|
||||
# SynthID corpus images (manifest.csv + README.md stay tracked)
|
||||
data/synthid_corpus/images/
|
||||
|
||||
# Reference materials
|
||||
_refs/
|
||||
|
||||
@@ -30,3 +33,4 @@ yolov8n.pt
|
||||
|
||||
# Claude Code local settings
|
||||
.claude/settings.local.json
|
||||
data/synthid_corpus/refs/
|
||||
|
||||
@@ -11,6 +11,10 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
|
||||
## Test and lint
|
||||
|
||||
- `bash maintain.sh` — uv-outdated, uv-secure, ruff check/fix, ruff format, pyright, pytest -n auto
|
||||
- `maintain.sh` does not currently finish green (pre-existing, not per-change): `uv-secure` aborts on a fixable transitive `idna` vuln, and strict pyright carries debt in `remove_ai_metadata` / `cli.py` (untyped piexif/PIL/click/rich). To gate a change, run `uv run ruff check`, `uv run pyright <changed files>`, `uv run pytest` directly.
|
||||
- Run `uv run` from the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch.
|
||||
- Metadata/C2PA tests assert against real committed fixtures in `data/samples/` (`chatgpt-*.png` = OpenAI C2PA, `firefly-1.png` = Adobe, `not-ai-*` = clean); synthetic byte blobs cover the JPEG/ISOBMFF format paths.
|
||||
- SynthID reference corpus: `scripts/synthid_corpus.py` ingests labeled images into `data/synthid_corpus/` (`manifest.csv` tracked, `images/` gitignored); see its README for the collection protocol and verification oracles.
|
||||
|
||||
## Configuration
|
||||
|
||||
@@ -21,14 +25,14 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
|
||||
|
||||
- `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor.
|
||||
- `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, and `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI). Add a new issuer here, not inline.
|
||||
- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None; `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
|
||||
- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
|
||||
- `face_protector.py` — YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features
|
||||
|
||||
## Known limitations
|
||||
|
||||
- `invisible` pipeline downscales to model-native resolution (1024 px for SDXL) before diffusion. Degrades fine text in infographics. Tracked; fix is tile-based diffusion.
|
||||
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs)
|
||||
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files.
|
||||
- `ultralytics` monkey-patches `PIL.Image.open` and tries to autoload `pi_heif`. When `pi_heif` is missing, opening files raises `ModuleNotFoundError`, not `UnidentifiedImageError`. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`.
|
||||
- Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID` + `IPTC_AI_MARKERS`. C2PA removal in those containers is implemented via `noai/isobmff.py` (top-level ``uuid`` / ``jumb`` box stripper, no re-encoding). EXIF/XMP boxes inside those containers are not yet scrubbed.
|
||||
- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus.
|
||||
- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
|
||||
- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation **0.92**, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods.
|
||||
- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". Also confirmed against **OpenAI's** SynthID (2026-05-23): a fresh ChatGPT/gpt-image output read "SynthID detected" on openai.com/verify before the local SDXL run and "SynthID not detected" after (corpus regression chain: pos `4ef377bd` -> cleaned `47188e88`). The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
|
||||
|
||||
@@ -187,6 +187,10 @@ remove-ai-watermarks visible image.png -o clean.png
|
||||
remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0
|
||||
|
||||
# Check / strip AI metadata (C2PA, EXIF, "Made with AI" labels)
|
||||
# --check also flags SynthID-bearing sources: a C2PA manifest signed by
|
||||
# Google or OpenAI implies an invisible SynthID watermark in the pixels
|
||||
# (both vendors pair the two). Adobe Firefly / Microsoft sign C2PA without
|
||||
# SynthID, so they are reported as C2PA only.
|
||||
remove-ai-watermarks metadata image.png --check
|
||||
remove-ai-watermarks metadata image.png --remove
|
||||
|
||||
@@ -253,7 +257,7 @@ pip install certifi
|
||||
|
||||
Tracked but not yet implemented:
|
||||
|
||||
- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. Open.
|
||||
- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. The spectral phase-coherence surrogate from [reverse-SynthID](https://github.com/aloshdenny/reverse-SynthID) was evaluated and does not separate watermarked from cleaned real-content images (it only fires on controlled solid-color references at exact resolution), so it is not a usable oracle. Open.
|
||||
- **AVIF / HEIF / JPEG-XL detection limits**. Removal strips top-level C2PA `uuid` and JUMBF `jumb` boxes. EXIF/XMP boxes inside these containers are not yet scrubbed (PNG and JPEG are fully covered).
|
||||
- **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.
|
||||
|
||||
|
||||
@@ -0,0 +1,75 @@
|
||||
# SynthID reference corpus
|
||||
|
||||
A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
|
||||
|
||||
1. **Per-resolution spectral codebook** for an experimental SynthID detector
|
||||
(carrier frequencies are resolution-dependent, so labels must record the
|
||||
exact native resolution).
|
||||
2. **Removal regression set** — verify that our pipeline turns a SynthID-positive
|
||||
image into a negative one.
|
||||
|
||||
There is no reliable local detector of the SynthID pixel watermark (Google's
|
||||
decoder is proprietary). The ground-truth label therefore comes from an
|
||||
external oracle, recorded per image in `verified_via` (see below).
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
data/synthid_corpus/
|
||||
README.md # this protocol (committed)
|
||||
manifest.csv # labels + provenance (committed, reviewable)
|
||||
images/ # the actual files (gitignored, local-only)
|
||||
pos/ # SynthID present
|
||||
neg/ # SynthID absent
|
||||
cleaned/ # our pipeline output from a pos image
|
||||
```
|
||||
|
||||
Images are gitignored on purpose: the corpus is large, may contain personal or
|
||||
licensed content, and SynthID-positive outputs are best kept local. The
|
||||
`manifest.csv` (sha256 + labels + extracted metadata) is the durable artifact.
|
||||
|
||||
## Verification levels (`verified_via`)
|
||||
|
||||
Ground-truth quality, strongest first:
|
||||
|
||||
- `gemini-app` — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
|
||||
- `openai-verify` — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
|
||||
- `synthid-portal` — checked via Google's SynthID Detector portal.
|
||||
- `c2pa-metadata` — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
|
||||
- `third-party` — label asserted by an external dataset, not independently verified.
|
||||
- `none` — unverified.
|
||||
|
||||
Prefer `gemini-app` for any image that will train the codebook or gate a test.
|
||||
|
||||
## What to collect
|
||||
|
||||
For the **codebook** (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
|
||||
|
||||
- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per
|
||||
resolution materially improves carrier discovery).
|
||||
- At each target resolution, also a batch of **pure-black (#000000)** and
|
||||
**pure-white (#FFFFFF)** fills generated by the SynthID model — these isolate
|
||||
the content-independent carrier (the watermark is most of the signal there).
|
||||
|
||||
For the **regression set**:
|
||||
|
||||
- A handful of `pos` images, their `cleaned` counterparts (run through our
|
||||
pipeline), and the cleaned re-verified via `gemini-app` (should read negative).
|
||||
- `neg` controls: non-AI photos and outputs from non-SynthID models (SD,
|
||||
Midjourney, Firefly) verified negative.
|
||||
|
||||
Avoid personal or identifiable content; the corpus stays local.
|
||||
|
||||
## Ingesting
|
||||
|
||||
Use `scripts/synthid_corpus.py` — it copies a file in, records its sha256,
|
||||
resolution, format, and C2PA issuer (via our own detector), and appends a row
|
||||
to `manifest.csv`:
|
||||
|
||||
```bash
|
||||
uv run python scripts/synthid_corpus.py ingest path/to/*.png \
|
||||
--label pos --source "Gemini app" --model gemini-3-pro \
|
||||
--verified-via gemini-app --notes "1024x1024 batch"
|
||||
|
||||
uv run python scripts/synthid_corpus.py status # counts by label / resolution / verification
|
||||
```
|
||||
@@ -0,0 +1,4 @@
|
||||
sha256,filename,label,source,model,width,height,format,c2pa_issuer,synthid_metadata,verified_via,added,notes
|
||||
4ef377bde1a1d4eff141972841938643b173f5052992a018b9a21b31ac31731e,"4ef377bd-ChatGPT Image May 23, 2026, 02_43_02 PM.png",pos,ChatGPT,gpt-image,1254,1254,png,OpenAI,yes,openai-verify,2026-05-23T21:48:12Z,fresh post-rollout 2026-05-23; openai.com/verify: SynthID+C2PA detected
|
||||
d09f84c0e4c6d8b336bf4a9a7277314e940dcb5052ae7051e785cbb3bb42d656,d09f84c0-Gemini_Generated_Image_vq7wkwvq7wkwvq7w.png,pos,Gemini app,gemini,2816,1536,png,Google LLC,yes,c2pa-metadata,2026-05-23T21:52:40Z,"user: latest Gemini, SynthID v2"
|
||||
47188e88f956291bd38ab6906e5f21eb273d4a697ddc8b4479deac9f48915e1a,47188e88-disco_synthid_removed.png,cleaned,our pipeline (invisible/SDXL),stabilityai/stable-diffusion-xl-base-1.0,1254,1254,png,,,openai-verify,2026-05-23T22:06:54Z,cleaned from 4ef377bd disco; openai.com/verify: SynthID NOT detected (defeated)
|
||||
|
@@ -0,0 +1,211 @@
|
||||
"""Ingest and inspect the local SynthID reference corpus.
|
||||
|
||||
Copies images into ``data/synthid_corpus/images/<label>/`` and records one row
|
||||
per image in ``manifest.csv`` (sha256, resolution, format, C2PA issuer, and the
|
||||
external verification level). Dogfoods the project's own C2PA detector so the
|
||||
recorded metadata matches what the library sees.
|
||||
|
||||
See ``data/synthid_corpus/README.md`` for the collection protocol.
|
||||
|
||||
Usage:
|
||||
uv run python scripts/synthid_corpus.py ingest IMAGES... --label pos \\
|
||||
--source "Gemini app" --model gemini-3-pro --verified-via gemini-app
|
||||
uv run python scripts/synthid_corpus.py status
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import csv
|
||||
import hashlib
|
||||
import logging
|
||||
import shutil
|
||||
from collections import Counter
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import click
|
||||
from PIL import Image
|
||||
from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
console = Console()
|
||||
|
||||
DEFAULT_ROOT = Path(__file__).resolve().parent.parent / "data" / "synthid_corpus"
|
||||
LABELS = ("pos", "neg", "cleaned")
|
||||
VERIFIED_VIA = ("gemini-app", "openai-verify", "synthid-portal", "c2pa-metadata", "third-party", "none")
|
||||
FIELDNAMES = [
|
||||
"sha256",
|
||||
"filename",
|
||||
"label",
|
||||
"source",
|
||||
"model",
|
||||
"width",
|
||||
"height",
|
||||
"format",
|
||||
"c2pa_issuer",
|
||||
"synthid_metadata",
|
||||
"verified_via",
|
||||
"added",
|
||||
"notes",
|
||||
]
|
||||
|
||||
|
||||
def _manifest_path(root: Path) -> Path:
|
||||
return root / "manifest.csv"
|
||||
|
||||
|
||||
def _read_manifest(root: Path) -> list[dict[str, str]]:
|
||||
path = _manifest_path(root)
|
||||
if not path.exists():
|
||||
return []
|
||||
with open(path, newline="") as f:
|
||||
return list(csv.DictReader(f))
|
||||
|
||||
|
||||
def _sha256(path: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with open(path, "rb") as f:
|
||||
for chunk in iter(lambda: f.read(1 << 20), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def _probe(path: Path) -> tuple[int, int, str, str, bool]:
|
||||
"""Return (width, height, format, c2pa_issuer, synthid_metadata)."""
|
||||
width = height = 0
|
||||
fmt = path.suffix.lstrip(".").lower()
|
||||
try:
|
||||
with Image.open(path) as img:
|
||||
width, height = img.size
|
||||
fmt = (img.format or fmt).lower()
|
||||
except Exception as exc: # unknown/user formats can raise non-OSError; see CLAUDE.md
|
||||
log.debug("PIL could not open %s: %s", path, exc)
|
||||
|
||||
info = extract_c2pa_info(path)
|
||||
issuer = str(info.get("issuer", ""))
|
||||
synthid = "synthid_watermark" in info
|
||||
return width, height, fmt, issuer, synthid
|
||||
|
||||
|
||||
@click.group()
|
||||
def cli() -> None:
|
||||
"""Manage the local SynthID reference corpus."""
|
||||
logging.basicConfig(level=logging.INFO, format="%(message)s")
|
||||
|
||||
|
||||
@cli.command()
|
||||
@click.argument("images", nargs=-1, required=True, type=click.Path(exists=True, path_type=Path))
|
||||
@click.option("--label", required=True, type=click.Choice(LABELS), help="SynthID label.")
|
||||
@click.option("--source", default="", help="Where the image came from (free text).")
|
||||
@click.option("--model", default="", help="Generating model, e.g. gemini-3-pro.")
|
||||
@click.option(
|
||||
"--verified-via", "verified_via", default="none", type=click.Choice(VERIFIED_VIA), help="Ground-truth oracle."
|
||||
)
|
||||
@click.option("--notes", default="", help="Free-text notes (e.g. resolution batch).")
|
||||
@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
|
||||
def ingest(
|
||||
images: tuple[Path, ...],
|
||||
label: str,
|
||||
source: str,
|
||||
model: str,
|
||||
verified_via: str,
|
||||
notes: str,
|
||||
root: Path,
|
||||
) -> None:
|
||||
"""Copy IMAGES into the corpus and append rows to the manifest."""
|
||||
dest_dir = root / "images" / label
|
||||
dest_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
rows = _read_manifest(root)
|
||||
seen = {r["sha256"] for r in rows}
|
||||
added = 0
|
||||
new_rows: list[dict[str, str]] = []
|
||||
|
||||
for src in images:
|
||||
digest = _sha256(src)
|
||||
if digest in seen:
|
||||
console.print(f" [dim]skip (duplicate sha256): {src.name}[/]")
|
||||
continue
|
||||
seen.add(digest)
|
||||
|
||||
width, height, fmt, issuer, synthid = _probe(src)
|
||||
stored_name = f"{digest[:8]}-{src.name}"
|
||||
shutil.copy2(src, dest_dir / stored_name)
|
||||
|
||||
new_rows.append(
|
||||
{
|
||||
"sha256": digest,
|
||||
"filename": stored_name,
|
||||
"label": label,
|
||||
"source": source,
|
||||
"model": model,
|
||||
"width": str(width),
|
||||
"height": str(height),
|
||||
"format": fmt,
|
||||
"c2pa_issuer": issuer,
|
||||
"synthid_metadata": "yes" if synthid else "",
|
||||
"verified_via": verified_via,
|
||||
"added": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
|
||||
"notes": notes,
|
||||
}
|
||||
)
|
||||
added += 1
|
||||
flag = " [yellow](C2PA-SynthID)[/]" if synthid else ""
|
||||
console.print(f" [green]+[/] {label}/{stored_name} {width}x{height} {fmt}{flag}")
|
||||
|
||||
if new_rows:
|
||||
path = _manifest_path(root)
|
||||
write_header = not path.exists()
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(path, "a", newline="") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
|
||||
if write_header:
|
||||
writer.writeheader()
|
||||
writer.writerows(new_rows)
|
||||
|
||||
console.print(f"\n Ingested [bold]{added}[/] image(s); skipped {len(images) - added} duplicate(s).")
|
||||
|
||||
|
||||
@cli.command()
|
||||
@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
|
||||
def status(root: Path) -> None:
|
||||
"""Print corpus counts by label, resolution, and verification level."""
|
||||
rows = _read_manifest(root)
|
||||
if not rows:
|
||||
console.print(" [dim]Corpus is empty.[/]")
|
||||
return
|
||||
|
||||
by_label = Counter(r["label"] for r in rows)
|
||||
by_verified = Counter(r["verified_via"] for r in rows)
|
||||
by_res = Counter(f"{r['label']} {r['width']}x{r['height']}" for r in rows)
|
||||
|
||||
console.print(f"\n [bold]{len(rows)}[/] image(s) in {root}")
|
||||
|
||||
t = Table(title="By label", show_header=True, header_style="bold")
|
||||
t.add_column("Label")
|
||||
t.add_column("Count", justify="right")
|
||||
for k in LABELS:
|
||||
if by_label.get(k):
|
||||
t.add_row(k, str(by_label[k]))
|
||||
console.print(t)
|
||||
|
||||
t = Table(title="By label x resolution", show_header=True, header_style="bold")
|
||||
t.add_column("Label / resolution")
|
||||
t.add_column("Count", justify="right")
|
||||
for k, v in sorted(by_res.items()):
|
||||
t.add_row(k, str(v))
|
||||
console.print(t)
|
||||
|
||||
t = Table(title="By verification", show_header=True, header_style="bold")
|
||||
t.add_column("verified_via")
|
||||
t.add_column("Count", justify="right")
|
||||
for k, v in by_verified.most_common():
|
||||
t.add_row(k, str(v))
|
||||
console.print(t)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cli()
|
||||
@@ -310,6 +310,8 @@ def cmd_metadata(
|
||||
if has_ai:
|
||||
console.print(f" [yellow]⚠[/] AI metadata detected in {source.name}:")
|
||||
meta = get_ai_metadata(source)
|
||||
if synthid := meta.get("synthid_watermark"):
|
||||
console.print(f" [bold yellow]⚠ SynthID pixel watermark {synthid}[/]")
|
||||
table = Table(show_header=True, header_style="bold")
|
||||
table.add_column("Key", style="cyan")
|
||||
table.add_column("Value")
|
||||
|
||||
@@ -142,80 +142,45 @@ def has_ai_metadata(image_path: Path) -> bool:
|
||||
return any(marker in data for marker in IPTC_AI_MARKERS)
|
||||
|
||||
|
||||
def _scan_png_c2pa_chunk(image_path: Path) -> dict[str, str]:
|
||||
"""Extract a human-readable summary of the C2PA manifest in a PNG file.
|
||||
def synthid_source(image_path: Path) -> str | None:
|
||||
"""Return the vendor name(s) if the image carries a SynthID pixel watermark.
|
||||
|
||||
PIL does not expose the caBX JUMBF box via ``img.info``, so we delegate
|
||||
chunk extraction to the existing ``extract_c2pa_chunk`` helper and pull
|
||||
key fields from the JUMBF payload without a full CBOR parser.
|
||||
This is a *metadata-based* proxy: Google (Imagen/Gemini) and OpenAI
|
||||
(ChatGPT/DALL-E/gpt-image) embed an invisible SynthID watermark alongside
|
||||
a C2PA manifest, so a C2PA manifest signed by one of them on AI-generated
|
||||
content implies SynthID in the pixels. Adobe Firefly / Microsoft Designer
|
||||
sign C2PA but do not use SynthID, so they return None.
|
||||
|
||||
The verdict is reliable only while the C2PA manifest is intact -- absence
|
||||
is not proof, because C2PA can be stripped while the pixel watermark
|
||||
survives, and the pixel watermark itself is not locally detectable
|
||||
(proprietary decoder).
|
||||
|
||||
Args:
|
||||
image_path: Path to the image (PNG, JPEG, WebP, or ISOBMFF container).
|
||||
|
||||
Returns:
|
||||
Comma-joined vendor name(s) (e.g. ``"OpenAI"``) or None.
|
||||
"""
|
||||
import re
|
||||
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_vendors_in
|
||||
|
||||
from remove_ai_watermarks.noai.c2pa import extract_c2pa_chunk
|
||||
# PNG: the caBX chunk parser gives a clean, structured issuer.
|
||||
vendors = extract_c2pa_info(image_path).get("synthid_vendors")
|
||||
if vendors:
|
||||
return ", ".join(vendors)
|
||||
|
||||
raw = extract_c2pa_chunk(image_path)
|
||||
if raw is None:
|
||||
return {}
|
||||
|
||||
# extract_c2pa_chunk returns chunk_header (8 bytes) + data + crc (4 bytes).
|
||||
payload = raw[8:-4]
|
||||
result: dict[str, str] = {"c2pa_manifest": f"C2PA manifest ({len(payload)} bytes)"}
|
||||
|
||||
def _cbor_text_after(key: bytes) -> str | None:
|
||||
"""Return the CBOR text-string immediately following ``key``.
|
||||
|
||||
Handles CBOR major-type 3 length prefixes: direct (0x60-0x77),
|
||||
1-byte (0x78 NN), and 2-byte (0x79 NN NN).
|
||||
"""
|
||||
idx = payload.find(key)
|
||||
if idx < 0:
|
||||
return None
|
||||
p = idx + len(key)
|
||||
if p >= len(payload):
|
||||
return None
|
||||
head = payload[p]
|
||||
if 0x60 <= head <= 0x77:
|
||||
length, start = head - 0x60, p + 1
|
||||
elif head == 0x78 and p + 1 < len(payload):
|
||||
length, start = payload[p + 1], p + 2
|
||||
elif head == 0x79 and p + 2 < len(payload):
|
||||
length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
|
||||
else:
|
||||
return None
|
||||
raw_str = payload[start : start + length]
|
||||
try:
|
||||
return raw_str.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
return raw_str.decode("latin1", errors="replace")
|
||||
|
||||
if generator := _cbor_text_after(b"name"):
|
||||
result["claim_generator"] = generator
|
||||
|
||||
if spec := _cbor_text_after(b"specVersion"):
|
||||
result["c2pa_spec"] = spec
|
||||
|
||||
dst_match = re.search(
|
||||
rb"(http://cv\.iptc\.org/newscodes/digitalsourcetype/[A-Za-z0-9_-]+)",
|
||||
payload,
|
||||
)
|
||||
if dst_match:
|
||||
result["digital_source_type"] = dst_match.group(1).decode("latin1")
|
||||
|
||||
actions = sorted(
|
||||
{m.decode("latin1") for m in re.findall(rb"c2pa\.(created|converted|edited|opened|placed)", payload)}
|
||||
)
|
||||
if actions:
|
||||
result["c2pa_actions"] = ", ".join(actions)
|
||||
|
||||
# Scan cert DN printable strings for the signer org name.
|
||||
signer_match = re.search(
|
||||
rb"([A-Za-z][A-Za-z0-9 .,&'()\-]{2,48}OpenAI[A-Za-z0-9 .,&'()\-]{0,48})",
|
||||
payload,
|
||||
)
|
||||
if signer_match:
|
||||
result["signer"] = signer_match.group(1).decode("latin1").strip()
|
||||
|
||||
return result
|
||||
# Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the
|
||||
# C2PA manifest where the PNG parser can't reach it. Binary-scan for the
|
||||
# same signal: a C2PA manifest from a SynthID-using issuer on AI content.
|
||||
with open(image_path, "rb") as f:
|
||||
data = f.read(1024 * 1024)
|
||||
has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data
|
||||
# Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia".
|
||||
ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data
|
||||
if not (has_c2pa and ai_source):
|
||||
return None
|
||||
matched = synthid_vendors_in(data)
|
||||
return ", ".join(matched) if matched else None
|
||||
|
||||
|
||||
def get_ai_metadata(image_path: Path) -> dict[str, str]:
|
||||
@@ -229,6 +194,8 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
|
||||
"""
|
||||
from PIL import Image
|
||||
|
||||
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_verdict
|
||||
|
||||
result: dict[str, str] = {}
|
||||
|
||||
with Image.open(image_path) as img:
|
||||
@@ -241,7 +208,24 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
|
||||
else:
|
||||
result[key] = str(value)
|
||||
|
||||
result.update(_scan_png_c2pa_chunk(image_path))
|
||||
# C2PA manifest fields from the single canonical parser (noai/c2pa.py).
|
||||
c2pa = extract_c2pa_info(image_path)
|
||||
for key in (
|
||||
"c2pa_manifest",
|
||||
"claim_generator",
|
||||
"c2pa_spec",
|
||||
"issuer",
|
||||
"source_type",
|
||||
"actions",
|
||||
"synthid_watermark",
|
||||
):
|
||||
if key in c2pa:
|
||||
result.setdefault(key, str(c2pa[key]))
|
||||
|
||||
# Non-PNG containers (JPEG/WebP/AVIF): extract_c2pa_info is PNG-only, so
|
||||
# fall back to the format-agnostic source check for the SynthID verdict.
|
||||
if "synthid_watermark" not in result and (vendor := synthid_source(image_path)):
|
||||
result.setdefault("synthid_watermark", synthid_verdict(vendor))
|
||||
return result
|
||||
|
||||
|
||||
|
||||
@@ -29,6 +29,7 @@ from remove_ai_watermarks.noai.constants import (
|
||||
C2PA_ISSUERS,
|
||||
C2PA_SIGNATURES,
|
||||
PNG_SIGNATURE,
|
||||
SYNTHID_C2PA_ISSUERS,
|
||||
)
|
||||
|
||||
|
||||
@@ -129,10 +130,56 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]:
|
||||
return c2pa_info
|
||||
|
||||
|
||||
def _cbor_text_after(payload: bytes, key: bytes) -> str | None:
|
||||
"""Return the CBOR text-string immediately following ``key`` in ``payload``.
|
||||
|
||||
Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte
|
||||
(0x78 NN), and 2-byte (0x79 NN NN). This reads the actual encoded value, so
|
||||
it avoids the byte-grabbing artifacts a loose regex produces (e.g. the
|
||||
leading length byte showing up as ``fGPT-4o``).
|
||||
"""
|
||||
idx = payload.find(key)
|
||||
if idx < 0:
|
||||
return None
|
||||
p = idx + len(key)
|
||||
if p >= len(payload):
|
||||
return None
|
||||
head = payload[p]
|
||||
if 0x60 <= head <= 0x77:
|
||||
length, start = head - 0x60, p + 1
|
||||
elif head == 0x78 and p + 1 < len(payload):
|
||||
length, start = payload[p + 1], p + 2
|
||||
elif head == 0x79 and p + 2 < len(payload):
|
||||
length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
|
||||
else:
|
||||
return None
|
||||
raw_str = payload[start : start + length]
|
||||
try:
|
||||
return raw_str.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
return raw_str.decode("latin1", errors="replace")
|
||||
|
||||
|
||||
def synthid_verdict(vendors: str) -> str:
|
||||
"""Human-readable SynthID-source verdict, shared by all callers."""
|
||||
return f"likely present ({vendors} embeds SynthID with C2PA)"
|
||||
|
||||
|
||||
def synthid_vendors_in(buffer: bytes) -> list[str]:
|
||||
"""Return SynthID-using C2PA issuer names whose signature appears in ``buffer``.
|
||||
|
||||
Shared by the PNG caBX parser and the format-agnostic binary scan so both
|
||||
apply the same SYNTHID_C2PA_ISSUERS rule against their respective bytes.
|
||||
"""
|
||||
return sorted({name for sig, name in C2PA_ISSUERS.items() if sig in buffer and sig in SYNTHID_C2PA_ISSUERS})
|
||||
|
||||
|
||||
def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
|
||||
"""Parse C2PA chunk data and populate info dictionary."""
|
||||
c2pa_info["c2pa_manifest"] = f"C2PA manifest ({len(chunk_data)} bytes)"
|
||||
|
||||
# Find issuers
|
||||
issuers = []
|
||||
issuers: list[str] = []
|
||||
for sig, name in C2PA_ISSUERS.items():
|
||||
if sig in chunk_data:
|
||||
issuers.append(name)
|
||||
@@ -140,44 +187,22 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
|
||||
c2pa_info["issuer"] = ", ".join(set(issuers))
|
||||
|
||||
# Find AI tools
|
||||
ai_tools = []
|
||||
ai_tools: list[str] = []
|
||||
for sig, name in C2PA_AI_TOOLS.items():
|
||||
if sig in chunk_data:
|
||||
ai_tools.append(name)
|
||||
if ai_tools:
|
||||
c2pa_info["ai_tool"] = ", ".join(set(ai_tools))
|
||||
|
||||
# Extract software agent (multiple patterns)
|
||||
patterns = [
|
||||
rb"softwareAgent.*?dname([^\x00]+?)(?:q|l|m|n)",
|
||||
rb"software_agent[^\x00]*?([A-Za-z0-9_\-\.]+)",
|
||||
rb"Software[^\x00]*?([A-Za-z0-9_\-\. ]+)",
|
||||
]
|
||||
for pattern in patterns:
|
||||
match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
|
||||
if match:
|
||||
agent = match.group(1).decode("utf-8", errors="ignore").strip()
|
||||
if agent and len(agent) < 100:
|
||||
c2pa_info["software_agent"] = agent
|
||||
break
|
||||
|
||||
# Extract claim generator (multiple patterns)
|
||||
claim_patterns = [
|
||||
rb"claim_generator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
|
||||
rb"claimGenerator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
|
||||
rb"dname([^\x00]{3,50})(?:q|l|m|n|i)",
|
||||
]
|
||||
for pattern in claim_patterns:
|
||||
match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
|
||||
if match:
|
||||
gen_name = match.group(1).decode("utf-8", errors="ignore").strip()
|
||||
# Filter out common false positives
|
||||
if gen_name and len(gen_name) < 100 and not gen_name.startswith(("\\x", "\\\\x")):
|
||||
c2pa_info["claim_generator"] = gen_name
|
||||
break
|
||||
# Claim generator and spec version: read the CBOR text-string values
|
||||
# directly (regex byte-grabbing produced artifacts like ``fGPT-4o``).
|
||||
if generator := _cbor_text_after(chunk_data, b"name"):
|
||||
c2pa_info["claim_generator"] = generator
|
||||
if spec := _cbor_text_after(chunk_data, b"specVersion"):
|
||||
c2pa_info["c2pa_spec"] = spec
|
||||
|
||||
# Find actions
|
||||
actions = []
|
||||
actions: list[str] = []
|
||||
for sig, name in C2PA_ACTIONS.items():
|
||||
if sig in chunk_data:
|
||||
actions.append(name)
|
||||
@@ -192,12 +217,23 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
|
||||
c2pa_info["timestamps"] = [t.decode("utf-8") for t in timestamp_matches[:3]]
|
||||
|
||||
# Find digital source type
|
||||
ai_source = False
|
||||
if b"trainedAlgorithmicMedia" in chunk_data:
|
||||
c2pa_info["source_type"] = "trainedAlgorithmicMedia (AI-generated)"
|
||||
ai_source = True
|
||||
elif b"algorithmicMedia" in chunk_data:
|
||||
c2pa_info["source_type"] = "algorithmicMedia"
|
||||
elif b"compositeWithTrainedAlgorithmicMedia" in chunk_data:
|
||||
c2pa_info["source_type"] = "compositeWithTrainedAlgorithmicMedia (AI-enhanced)"
|
||||
ai_source = True
|
||||
|
||||
# SynthID pixel-watermark proxy: a C2PA manifest from a SynthID-using
|
||||
# vendor (Google/OpenAI) on AI-generated content implies an invisible
|
||||
# SynthID watermark in the pixels (see SYNTHID_C2PA_ISSUERS).
|
||||
synthid_vendors = synthid_vendors_in(chunk_data)
|
||||
if synthid_vendors and ai_source:
|
||||
c2pa_info["synthid_vendors"] = synthid_vendors
|
||||
c2pa_info["synthid_watermark"] = synthid_verdict(", ".join(synthid_vendors))
|
||||
|
||||
|
||||
def extract_c2pa_chunk(image_path: Path) -> bytes | None:
|
||||
|
||||
@@ -87,6 +87,27 @@ C2PA_ISSUERS = {
|
||||
b"Truepic": "Truepic",
|
||||
}
|
||||
|
||||
# C2PA issuers whose signed outputs also carry an invisible SynthID pixel
|
||||
# watermark -- a metadata proxy for "SynthID is in the pixels":
|
||||
# - Google (Imagen/Gemini): embeds SynthID, long-standing (DeepMind docs).
|
||||
# - OpenAI (ChatGPT/Codex/API): pairs SynthID with C2PA since ~2026-05-20.
|
||||
# Confirmed by OpenAI's Help Center ("C2PA and SynthID in OpenAI-generated
|
||||
# images", updated 2026-05-21): "Images generated with ChatGPT, Codex, and
|
||||
# our API include both C2PA metadata and SynthID watermarks." OpenAI also
|
||||
# notes a signal may be absent if "the image was created before these
|
||||
# signals were available" -- so OpenAI images from BEFORE the rollout carry
|
||||
# C2PA WITHOUT SynthID (e.g. data/samples/openai-images-2/amur-leopard.png,
|
||||
# C2PA timestamp 2026-04-22). For OpenAI the proxy is therefore "likely",
|
||||
# not certain; the verdict string is hedged accordingly. OpenAI's own oracle
|
||||
# is openai.com/verify (Google's is the Gemini app "Verify with SynthID").
|
||||
# The issuer byte ("OpenAI"/"Google") is verified locally against data/samples;
|
||||
# the SynthID pairing is documented behavior (Google: DeepMind; OpenAI: above).
|
||||
# Adobe Firefly and Microsoft Designer sign C2PA but do NOT use SynthID, so a
|
||||
# C2PA manifest alone is not a SynthID signal -- the issuer is. The pixel
|
||||
# watermark is not locally detectable (proprietary decoder); the C2PA companion
|
||||
# is the proxy, and only while the manifest is intact.
|
||||
SYNTHID_C2PA_ISSUERS = frozenset({b"Google", b"OpenAI"})
|
||||
|
||||
# C2PA known AI tools
|
||||
C2PA_AI_TOOLS = {
|
||||
b"GPT-4o": "GPT-4o",
|
||||
@@ -106,6 +127,8 @@ C2PA_ACTIONS = {
|
||||
b"c2pa.filtered": "filtered",
|
||||
b"c2pa.cropped": "cropped",
|
||||
b"c2pa.resized": "resized",
|
||||
b"c2pa.opened": "opened",
|
||||
b"c2pa.placed": "placed",
|
||||
}
|
||||
|
||||
# PNG signature
|
||||
|
||||
@@ -4,6 +4,7 @@ from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from PIL import Image
|
||||
from PIL.PngImagePlugin import PngInfo
|
||||
|
||||
@@ -12,8 +13,12 @@ from remove_ai_watermarks.metadata import (
|
||||
get_ai_metadata,
|
||||
has_ai_metadata,
|
||||
remove_ai_metadata,
|
||||
synthid_source,
|
||||
)
|
||||
|
||||
# Real, committed C2PA sample images used to ground the SynthID-source tests.
|
||||
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
|
||||
|
||||
# ── Key detection ───────────────────────────────────────────────────
|
||||
|
||||
|
||||
@@ -144,6 +149,109 @@ class TestGetAiMetadata:
|
||||
meta = get_ai_metadata(tmp_clean_png)
|
||||
assert meta == {}
|
||||
|
||||
def test_long_value_is_truncated(self, tmp_path: Path):
|
||||
img = Image.new("RGB", (32, 32))
|
||||
pnginfo = PngInfo()
|
||||
pnginfo.add_text("parameters", "x" * 300)
|
||||
path = tmp_path / "long.png"
|
||||
img.save(path, pnginfo=pnginfo)
|
||||
meta = get_ai_metadata(path)
|
||||
assert meta["parameters"].endswith("…")
|
||||
assert len(meta["parameters"]) <= 205
|
||||
|
||||
|
||||
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
|
||||
class TestGetAiMetadataRealSample:
|
||||
"""get_ai_metadata surfaces the consolidated C2PA fields on real images."""
|
||||
|
||||
def test_openai_sample_fields(self):
|
||||
meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
|
||||
assert "claim_generator" in meta
|
||||
assert "OpenAI" in meta["issuer"]
|
||||
assert "OpenAI" in meta["synthid_watermark"]
|
||||
assert "trainedAlgorithmicMedia" in meta["source_type"]
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"marker",
|
||||
[
|
||||
b"trainedAlgorithmicMedia",
|
||||
b"compositeSynthetic",
|
||||
b"algorithmicMedia",
|
||||
b"compositeWithTrainedAlgorithmicMedia",
|
||||
],
|
||||
)
|
||||
def test_has_ai_metadata_detects_each_iptc_marker(tmp_path: Path, marker: bytes):
|
||||
"""Each IPTC digitalSourceType AI marker in XMP triggers detection."""
|
||||
path = tmp_path / "iptc.jpg"
|
||||
path.write_bytes(b"\xff\xd8\xff\xe1<x:xmpmeta>" + marker + b"</x:xmpmeta>\xff\xd9")
|
||||
assert has_ai_metadata(path)
|
||||
|
||||
|
||||
# ── SynthID-source detection (metadata proxy) ────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
|
||||
class TestSynthIDSource:
|
||||
"""SynthID detection via the C2PA companion manifest.
|
||||
|
||||
Google (Imagen/Gemini) and OpenAI (ChatGPT/DALL-E/gpt-image) pair an
|
||||
invisible SynthID pixel watermark with a C2PA manifest. Adobe Firefly and
|
||||
Microsoft Designer sign C2PA Content Credentials but do NOT use SynthID,
|
||||
so the discriminating signal is the C2PA *issuer*, not the mere presence
|
||||
of a manifest. These tests run against real, committed sample images.
|
||||
"""
|
||||
|
||||
def test_openai_chatgpt_is_synthid_source(self):
|
||||
assert synthid_source(SAMPLES_DIR / "chatgpt-1.png") == "OpenAI"
|
||||
|
||||
def test_openai_verdict_in_get_ai_metadata(self):
|
||||
meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
|
||||
assert "synthid_watermark" in meta
|
||||
assert "OpenAI" in meta["synthid_watermark"]
|
||||
|
||||
def test_adobe_firefly_is_not_synthid_source(self):
|
||||
# Adobe signs C2PA (trainedAlgorithmicMedia) but embeds no SynthID.
|
||||
assert synthid_source(SAMPLES_DIR / "firefly-1.png") is None
|
||||
assert "synthid_watermark" not in get_ai_metadata(SAMPLES_DIR / "firefly-1.png")
|
||||
|
||||
def test_non_ai_image_is_not_synthid_source(self):
|
||||
assert synthid_source(SAMPLES_DIR / "not-ai-1.jpeg") is None
|
||||
|
||||
|
||||
class TestSynthIDSourceNonPng:
|
||||
"""SynthID-source detection must work beyond PNG.
|
||||
|
||||
ChatGPT/Gemini images saved as JPEG/WebP/AVIF carry their C2PA manifest in
|
||||
a non-PNG container (JPEG APP11, ISOBMFF uuid box), so the PNG caBX parser
|
||||
misses them. These use synthetic byte blobs (real fixtures aren't shipped).
|
||||
"""
|
||||
|
||||
def _c2pa_jpeg(self, tmp_path: Path, name: str, issuer: bytes, marker: bytes = b"trainedAlgorithmicMedia") -> Path:
|
||||
path = tmp_path / name
|
||||
# Minimal JPEG shell with an embedded C2PA-ish blob.
|
||||
blob = b"jumbc2pa" + issuer + b"..." + marker
|
||||
path.write_bytes(b"\xff\xd8\xff\xe1" + blob + b"\xff\xd9")
|
||||
return path
|
||||
|
||||
def test_openai_c2pa_in_jpeg(self, tmp_path: Path):
|
||||
path = self._c2pa_jpeg(tmp_path, "chatgpt.jpg", b"OpenAI")
|
||||
assert synthid_source(path) == "OpenAI"
|
||||
|
||||
def test_google_c2pa_in_jpeg(self, tmp_path: Path):
|
||||
path = self._c2pa_jpeg(tmp_path, "gemini.jpg", b"Google")
|
||||
assert synthid_source(path) == "Google LLC"
|
||||
|
||||
def test_adobe_c2pa_in_jpeg_is_none(self, tmp_path: Path):
|
||||
# Adobe signs C2PA but embeds no SynthID.
|
||||
path = self._c2pa_jpeg(tmp_path, "firefly.jpg", b"Adobe")
|
||||
assert synthid_source(path) is None
|
||||
|
||||
def test_openai_without_ai_marker_is_none(self, tmp_path: Path):
|
||||
# Issuer present but no AI digital-source marker -> not a SynthID source.
|
||||
path = self._c2pa_jpeg(tmp_path, "edited.jpg", b"OpenAI", marker=b"")
|
||||
assert synthid_source(path) is None
|
||||
|
||||
|
||||
# ── remove_ai_metadata ──────────────────────────────────────────────
|
||||
|
||||
|
||||
@@ -2,10 +2,18 @@
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import struct
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from remove_ai_watermarks.noai.c2pa import (
|
||||
_cbor_text_after,
|
||||
extract_c2pa_chunk,
|
||||
extract_c2pa_info,
|
||||
has_c2pa_metadata,
|
||||
inject_c2pa_chunk,
|
||||
synthid_verdict,
|
||||
)
|
||||
from remove_ai_watermarks.noai.cleaner import (
|
||||
has_ai_content,
|
||||
@@ -26,6 +34,10 @@ from remove_ai_watermarks.noai.extractor import (
|
||||
get_ai_metadata_summary,
|
||||
has_ai_metadata,
|
||||
)
|
||||
from remove_ai_watermarks.noai.isobmff import (
|
||||
is_isobmff,
|
||||
strip_c2pa_boxes,
|
||||
)
|
||||
|
||||
# ── Constants ───────────────────────────────────────────────────────
|
||||
|
||||
@@ -128,3 +140,130 @@ class TestC2PA:
|
||||
|
||||
def test_c2pa_returns_false_for_non_png(self, tmp_jpeg_path):
|
||||
assert not has_c2pa_metadata(tmp_jpeg_path)
|
||||
|
||||
|
||||
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
|
||||
|
||||
|
||||
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
|
||||
class TestC2PARealSamples:
|
||||
"""Parser behavior on real committed C2PA images."""
|
||||
|
||||
def test_detects_c2pa_in_openai_png(self):
|
||||
assert has_c2pa_metadata(SAMPLES_DIR / "chatgpt-1.png")
|
||||
|
||||
def test_extract_info_openai_fields(self):
|
||||
info = extract_c2pa_info(SAMPLES_DIR / "chatgpt-1.png")
|
||||
assert info["has_c2pa"] is True
|
||||
assert "OpenAI" in info["issuer"]
|
||||
assert "c2pa_manifest" in info # "C2PA manifest (N bytes)"
|
||||
assert "trainedAlgorithmicMedia" in info["source_type"]
|
||||
# CBOR-clean claim generator, no regex artifacts (e.g. "fGPT-4o").
|
||||
assert info["claim_generator"]
|
||||
assert not info["claim_generator"].startswith("f")
|
||||
assert "synthid_watermark" in info
|
||||
|
||||
def test_extract_info_adobe_has_no_synthid(self):
|
||||
info = extract_c2pa_info(SAMPLES_DIR / "firefly-1.png")
|
||||
assert "Adobe" in info["issuer"]
|
||||
assert "synthid_watermark" not in info
|
||||
|
||||
def test_extract_chunk_returns_bytes(self):
|
||||
chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
|
||||
assert chunk is not None
|
||||
assert chunk[4:8] == b"caBX" # chunk type in the 8-byte header
|
||||
|
||||
def test_inject_round_trip(self, tmp_clean_png, tmp_path):
|
||||
"""Extract a real C2PA chunk, inject into a clean PNG, re-detect."""
|
||||
chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
|
||||
out = tmp_path / "injected.png"
|
||||
inject_c2pa_chunk(tmp_clean_png, out, chunk)
|
||||
assert has_c2pa_metadata(out)
|
||||
assert "OpenAI" in extract_c2pa_info(out)["issuer"]
|
||||
|
||||
|
||||
class TestC2PAInjectValidation:
|
||||
def test_inject_rejects_non_png(self, tmp_path):
|
||||
with pytest.raises(ValueError, match="only supported for PNG"):
|
||||
inject_c2pa_chunk(tmp_path / "in.jpg", tmp_path / "out.png", b"")
|
||||
|
||||
|
||||
# ── CBOR text extraction (parser internals) ─────────────────────────
|
||||
|
||||
|
||||
class TestCborTextAfter:
|
||||
"""_cbor_text_after handles the three CBOR text-string length prefixes."""
|
||||
|
||||
def test_direct_length(self):
|
||||
# major-type 3, direct length (0x60 + len). "abc" -> 0x63.
|
||||
payload = b"name" + bytes([0x63]) + b"abc"
|
||||
assert _cbor_text_after(payload, b"name") == "abc"
|
||||
|
||||
def test_one_byte_length(self):
|
||||
s = b"x" * 30
|
||||
payload = b"name" + bytes([0x78, 30]) + s
|
||||
assert _cbor_text_after(payload, b"name") == "x" * 30
|
||||
|
||||
def test_two_byte_length(self):
|
||||
s = b"y" * 300
|
||||
payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s
|
||||
assert _cbor_text_after(payload, b"name") == "y" * 300
|
||||
|
||||
def test_key_not_found_returns_none(self):
|
||||
assert _cbor_text_after(b"nothing here", b"name") is None
|
||||
|
||||
def test_key_at_end_returns_none(self):
|
||||
assert _cbor_text_after(b"prefixname", b"name") is None
|
||||
|
||||
def test_invalid_head_returns_none(self):
|
||||
# 0x00 is not a text-string head.
|
||||
assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
|
||||
|
||||
def test_latin1_fallback_on_invalid_utf8(self):
|
||||
payload = b"name" + bytes([0x61]) + b"\xff" # len 1, invalid utf-8
|
||||
assert _cbor_text_after(payload, b"name") is not None
|
||||
|
||||
|
||||
class TestSynthIDVerdict:
|
||||
def test_format(self):
|
||||
assert synthid_verdict("OpenAI") == "likely present (OpenAI embeds SynthID with C2PA)"
|
||||
|
||||
def test_multiple_vendors(self):
|
||||
assert "Google LLC, OpenAI" in synthid_verdict("Google LLC, OpenAI")
|
||||
|
||||
|
||||
# ── ISOBMFF (AVIF / HEIF / JPEG-XL container stripping) ──────────────
|
||||
|
||||
FTYP = b"\x00\x00\x00\x18ftypavif\x00\x00\x00\x00avifmif1" # 24-byte ftyp box
|
||||
|
||||
|
||||
class TestISOBMFF:
|
||||
def test_is_isobmff_true(self):
|
||||
assert is_isobmff(FTYP)
|
||||
|
||||
def test_is_isobmff_false_for_png(self):
|
||||
assert not is_isobmff(b"\x89PNG\r\n\x1a\n\x00\x00")
|
||||
|
||||
def test_is_isobmff_false_for_short(self):
|
||||
assert not is_isobmff(b"abc")
|
||||
|
||||
def test_strips_jpegxl_jumb_box(self):
|
||||
"""JPEG-XL stores JUMBF in a ``jumb`` box, always stripped."""
|
||||
jumb = struct.pack(">I", 8 + 5) + b"jumb" + b"hello"
|
||||
cleaned, stripped = strip_c2pa_boxes(FTYP + jumb)
|
||||
assert stripped == 1
|
||||
assert cleaned == FTYP
|
||||
|
||||
def test_keeps_non_c2pa_box_with_64bit_size(self):
|
||||
"""size==1 means a 64-bit largesize follows; non-C2PA box is kept."""
|
||||
payload = b"\x00" * 8
|
||||
box = b"\x00\x00\x00\x01" + b"free" + struct.pack(">Q", 16 + len(payload)) + payload
|
||||
cleaned, stripped = strip_c2pa_boxes(FTYP + box)
|
||||
assert stripped == 0
|
||||
assert cleaned == FTYP + box
|
||||
|
||||
def test_malformed_box_does_not_crash(self):
|
||||
# A box claiming size 4 (< 8-byte header) must terminate iteration safely.
|
||||
cleaned, stripped = strip_c2pa_boxes(FTYP + b"\x00\x00\x00\x04XXXX")
|
||||
assert stripped == 0
|
||||
assert cleaned.startswith(FTYP)
|
||||
|
||||
@@ -87,6 +87,24 @@ class TestModelProfiles:
|
||||
def test_recommended_strength_medium(self):
|
||||
assert get_recommended_strength("unknown_type") == 0.35
|
||||
|
||||
@pytest.mark.parametrize("wm_type", ["stegastamp", "stegasamp", "treering", "ringid"])
|
||||
def test_high_perturbation_watermark_types(self, wm_type):
|
||||
"""Robust spatial watermarks need aggressive (0.7) regeneration."""
|
||||
assert get_recommended_strength(wm_type) == 0.7
|
||||
|
||||
@pytest.mark.parametrize("wm_type", ["stablesignature", "dwtectsvd", "rivagan", "ssl", "hidden"])
|
||||
def test_low_perturbation_watermark_types(self, wm_type):
|
||||
"""Fragile frequency/latent watermarks break at low (0.04) strength."""
|
||||
assert get_recommended_strength(wm_type) == 0.04
|
||||
|
||||
def test_strength_match_is_case_insensitive(self):
|
||||
assert get_recommended_strength("TreeRing") == 0.7
|
||||
assert get_recommended_strength("StableSignature") == 0.04
|
||||
|
||||
def test_strength_matches_substring_in_descriptive_name(self):
|
||||
# e.g. a CLI passing "treering_v2" or "synthid-stegastamp" still maps.
|
||||
assert get_recommended_strength("treering_v2") == 0.7
|
||||
|
||||
|
||||
# ── Format utilities ────────────────────────────────────────────────
|
||||
|
||||
|
||||
@@ -0,0 +1,66 @@
|
||||
"""Tests for the SynthID corpus ingestion script."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import csv
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from click.testing import CliRunner
|
||||
|
||||
# scripts/ is not an installed package; add it to the path for import.
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "scripts"))
|
||||
|
||||
import synthid_corpus
|
||||
|
||||
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
|
||||
|
||||
|
||||
def _manifest_rows(root: Path) -> list[dict[str, str]]:
|
||||
with open(root / "manifest.csv", newline="") as f:
|
||||
return list(csv.DictReader(f))
|
||||
|
||||
|
||||
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
|
||||
class TestIngest:
|
||||
def test_ingest_openai_flags_synthid_metadata(self, tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(
|
||||
synthid_corpus.cli,
|
||||
["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)],
|
||||
)
|
||||
assert result.exit_code == 0, result.output
|
||||
|
||||
rows = _manifest_rows(tmp_path)
|
||||
assert len(rows) == 1
|
||||
row = rows[0]
|
||||
assert row["label"] == "pos"
|
||||
assert row["synthid_metadata"] == "yes"
|
||||
assert int(row["width"]) > 0
|
||||
assert int(row["height"]) > 0
|
||||
# The copied file lands under images/pos/ with a sha-prefixed name.
|
||||
assert (tmp_path / "images" / "pos" / row["filename"]).exists()
|
||||
|
||||
def test_ingest_firefly_not_flagged(self, tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
runner.invoke(
|
||||
synthid_corpus.cli,
|
||||
["ingest", str(SAMPLES_DIR / "firefly-1.png"), "--label", "neg", "--root", str(tmp_path)],
|
||||
)
|
||||
rows = _manifest_rows(tmp_path)
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["synthid_metadata"] == "" # Adobe signs C2PA but not SynthID
|
||||
|
||||
def test_ingest_dedupes_by_sha256(self, tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
args = ["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)]
|
||||
runner.invoke(synthid_corpus.cli, args)
|
||||
runner.invoke(synthid_corpus.cli, args) # second time: duplicate
|
||||
assert len(_manifest_rows(tmp_path)) == 1
|
||||
|
||||
def test_status_on_empty_corpus(self, tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(synthid_corpus.cli, ["status", "--root", str(tmp_path)])
|
||||
assert result.exit_code == 0
|
||||
assert "empty" in result.output.lower()
|
||||
Reference in New Issue
Block a user