feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests

Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").

- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
  surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
  JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
  (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
  metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
  (manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
  container stripping, all IPTC AI markers, and invisible watermark strength
  tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).

Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
test-user
2026-05-24 11:32:46 -07:00
parent c1ff4e1cd9
commit f07ce10c72
14 changed files with 785 additions and 107 deletions
+4
View File
@@ -21,6 +21,9 @@ Thumbs.db
# Test results
data/results/
# SynthID corpus images (manifest.csv + README.md stay tracked)
data/synthid_corpus/images/
# Reference materials
_refs/
@@ -30,3 +33,4 @@ yolov8n.pt
# Claude Code local settings
.claude/settings.local.json
data/synthid_corpus/refs/
+8 -4
View File
@@ -11,6 +11,10 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
## Test and lint
- `bash maintain.sh` — uv-outdated, uv-secure, ruff check/fix, ruff format, pyright, pytest -n auto
- `maintain.sh` does not currently finish green (pre-existing, not per-change): `uv-secure` aborts on a fixable transitive `idna` vuln, and strict pyright carries debt in `remove_ai_metadata` / `cli.py` (untyped piexif/PIL/click/rich). To gate a change, run `uv run ruff check`, `uv run pyright <changed files>`, `uv run pytest` directly.
- Run `uv run` from the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch.
- Metadata/C2PA tests assert against real committed fixtures in `data/samples/` (`chatgpt-*.png` = OpenAI C2PA, `firefly-1.png` = Adobe, `not-ai-*` = clean); synthetic byte blobs cover the JPEG/ISOBMFF format paths.
- SynthID reference corpus: `scripts/synthid_corpus.py` ingests labeled images into `data/synthid_corpus/` (`manifest.csv` tracked, `images/` gitignored); see its README for the collection protocol and verification oracles.
## Configuration
@@ -21,14 +25,14 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
- `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor.
- `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, and `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI). Add a new issuer here, not inline.
- `metadata.py``synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None; `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
- `metadata.py``synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
- `face_protector.py` — YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features
## Known limitations
- `invisible` pipeline downscales to model-native resolution (1024 px for SDXL) before diffusion. Degrades fine text in infographics. Tracked; fix is tile-based diffusion.
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs)
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files.
- `ultralytics` monkey-patches `PIL.Image.open` and tries to autoload `pi_heif`. When `pi_heif` is missing, opening files raises `ModuleNotFoundError`, not `UnidentifiedImageError`. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`.
- Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID` + `IPTC_AI_MARKERS`. C2PA removal in those containers is implemented via `noai/isobmff.py` (top-level ``uuid`` / ``jumb`` box stripper, no re-encoding). EXIF/XMP boxes inside those containers are not yet scrubbed.
- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus.
- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation **0.92**, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods.
- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". Also confirmed against **OpenAI's** SynthID (2026-05-23): a fresh ChatGPT/gpt-image output read "SynthID detected" on openai.com/verify before the local SDXL run and "SynthID not detected" after (corpus regression chain: pos `4ef377bd` -> cleaned `47188e88`). The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
+5 -1
View File
@@ -187,6 +187,10 @@ remove-ai-watermarks visible image.png -o clean.png
remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0
# Check / strip AI metadata (C2PA, EXIF, "Made with AI" labels)
# --check also flags SynthID-bearing sources: a C2PA manifest signed by
# Google or OpenAI implies an invisible SynthID watermark in the pixels
# (both vendors pair the two). Adobe Firefly / Microsoft sign C2PA without
# SynthID, so they are reported as C2PA only.
remove-ai-watermarks metadata image.png --check
remove-ai-watermarks metadata image.png --remove
@@ -253,7 +257,7 @@ pip install certifi
Tracked but not yet implemented:
- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. Open.
- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. The spectral phase-coherence surrogate from [reverse-SynthID](https://github.com/aloshdenny/reverse-SynthID) was evaluated and does not separate watermarked from cleaned real-content images (it only fires on controlled solid-color references at exact resolution), so it is not a usable oracle. Open.
- **AVIF / HEIF / JPEG-XL detection limits**. Removal strips top-level C2PA `uuid` and JUMBF `jumb` boxes. EXIF/XMP boxes inside these containers are not yet scrubbed (PNG and JPEG are fully covered).
- **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.
+75
View File
@@ -0,0 +1,75 @@
# SynthID reference corpus
A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
1. **Per-resolution spectral codebook** for an experimental SynthID detector
(carrier frequencies are resolution-dependent, so labels must record the
exact native resolution).
2. **Removal regression set** — verify that our pipeline turns a SynthID-positive
image into a negative one.
There is no reliable local detector of the SynthID pixel watermark (Google's
decoder is proprietary). The ground-truth label therefore comes from an
external oracle, recorded per image in `verified_via` (see below).
## Layout
```
data/synthid_corpus/
README.md # this protocol (committed)
manifest.csv # labels + provenance (committed, reviewable)
images/ # the actual files (gitignored, local-only)
pos/ # SynthID present
neg/ # SynthID absent
cleaned/ # our pipeline output from a pos image
```
Images are gitignored on purpose: the corpus is large, may contain personal or
licensed content, and SynthID-positive outputs are best kept local. The
`manifest.csv` (sha256 + labels + extracted metadata) is the durable artifact.
## Verification levels (`verified_via`)
Ground-truth quality, strongest first:
- `gemini-app` — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
- `openai-verify` — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
- `synthid-portal` — checked via Google's SynthID Detector portal.
- `c2pa-metadata` — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
- `third-party` — label asserted by an external dataset, not independently verified.
- `none` — unverified.
Prefer `gemini-app` for any image that will train the codebook or gate a test.
## What to collect
For the **codebook** (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per
resolution materially improves carrier discovery).
- At each target resolution, also a batch of **pure-black (#000000)** and
**pure-white (#FFFFFF)** fills generated by the SynthID model — these isolate
the content-independent carrier (the watermark is most of the signal there).
For the **regression set**:
- A handful of `pos` images, their `cleaned` counterparts (run through our
pipeline), and the cleaned re-verified via `gemini-app` (should read negative).
- `neg` controls: non-AI photos and outputs from non-SynthID models (SD,
Midjourney, Firefly) verified negative.
Avoid personal or identifiable content; the corpus stays local.
## Ingesting
Use `scripts/synthid_corpus.py` — it copies a file in, records its sha256,
resolution, format, and C2PA issuer (via our own detector), and appends a row
to `manifest.csv`:
```bash
uv run python scripts/synthid_corpus.py ingest path/to/*.png \
--label pos --source "Gemini app" --model gemini-3-pro \
--verified-via gemini-app --notes "1024x1024 batch"
uv run python scripts/synthid_corpus.py status # counts by label / resolution / verification
```
+4
View File
@@ -0,0 +1,4 @@
sha256,filename,label,source,model,width,height,format,c2pa_issuer,synthid_metadata,verified_via,added,notes
4ef377bde1a1d4eff141972841938643b173f5052992a018b9a21b31ac31731e,"4ef377bd-ChatGPT Image May 23, 2026, 02_43_02 PM.png",pos,ChatGPT,gpt-image,1254,1254,png,OpenAI,yes,openai-verify,2026-05-23T21:48:12Z,fresh post-rollout 2026-05-23; openai.com/verify: SynthID+C2PA detected
d09f84c0e4c6d8b336bf4a9a7277314e940dcb5052ae7051e785cbb3bb42d656,d09f84c0-Gemini_Generated_Image_vq7wkwvq7wkwvq7w.png,pos,Gemini app,gemini,2816,1536,png,Google LLC,yes,c2pa-metadata,2026-05-23T21:52:40Z,"user: latest Gemini, SynthID v2"
47188e88f956291bd38ab6906e5f21eb273d4a697ddc8b4479deac9f48915e1a,47188e88-disco_synthid_removed.png,cleaned,our pipeline (invisible/SDXL),stabilityai/stable-diffusion-xl-base-1.0,1254,1254,png,,,openai-verify,2026-05-23T22:06:54Z,cleaned from 4ef377bd disco; openai.com/verify: SynthID NOT detected (defeated)
1 sha256 filename label source model width height format c2pa_issuer synthid_metadata verified_via added notes
2 4ef377bde1a1d4eff141972841938643b173f5052992a018b9a21b31ac31731e 4ef377bd-ChatGPT Image May 23, 2026, 02_43_02 PM.png pos ChatGPT gpt-image 1254 1254 png OpenAI yes openai-verify 2026-05-23T21:48:12Z fresh post-rollout 2026-05-23; openai.com/verify: SynthID+C2PA detected
3 d09f84c0e4c6d8b336bf4a9a7277314e940dcb5052ae7051e785cbb3bb42d656 d09f84c0-Gemini_Generated_Image_vq7wkwvq7wkwvq7w.png pos Gemini app gemini 2816 1536 png Google LLC yes c2pa-metadata 2026-05-23T21:52:40Z user: latest Gemini, SynthID v2
4 47188e88f956291bd38ab6906e5f21eb273d4a697ddc8b4479deac9f48915e1a 47188e88-disco_synthid_removed.png cleaned our pipeline (invisible/SDXL) stabilityai/stable-diffusion-xl-base-1.0 1254 1254 png openai-verify 2026-05-23T22:06:54Z cleaned from 4ef377bd disco; openai.com/verify: SynthID NOT detected (defeated)
+211
View File
@@ -0,0 +1,211 @@
"""Ingest and inspect the local SynthID reference corpus.
Copies images into ``data/synthid_corpus/images/<label>/`` and records one row
per image in ``manifest.csv`` (sha256, resolution, format, C2PA issuer, and the
external verification level). Dogfoods the project's own C2PA detector so the
recorded metadata matches what the library sees.
See ``data/synthid_corpus/README.md`` for the collection protocol.
Usage:
uv run python scripts/synthid_corpus.py ingest IMAGES... --label pos \\
--source "Gemini app" --model gemini-3-pro --verified-via gemini-app
uv run python scripts/synthid_corpus.py status
"""
from __future__ import annotations
import csv
import hashlib
import logging
import shutil
from collections import Counter
from datetime import datetime, timezone
from pathlib import Path
import click
from PIL import Image
from rich.console import Console
from rich.table import Table
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info
log = logging.getLogger(__name__)
console = Console()
DEFAULT_ROOT = Path(__file__).resolve().parent.parent / "data" / "synthid_corpus"
LABELS = ("pos", "neg", "cleaned")
VERIFIED_VIA = ("gemini-app", "openai-verify", "synthid-portal", "c2pa-metadata", "third-party", "none")
FIELDNAMES = [
"sha256",
"filename",
"label",
"source",
"model",
"width",
"height",
"format",
"c2pa_issuer",
"synthid_metadata",
"verified_via",
"added",
"notes",
]
def _manifest_path(root: Path) -> Path:
return root / "manifest.csv"
def _read_manifest(root: Path) -> list[dict[str, str]]:
path = _manifest_path(root)
if not path.exists():
return []
with open(path, newline="") as f:
return list(csv.DictReader(f))
def _sha256(path: Path) -> str:
h = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(1 << 20), b""):
h.update(chunk)
return h.hexdigest()
def _probe(path: Path) -> tuple[int, int, str, str, bool]:
"""Return (width, height, format, c2pa_issuer, synthid_metadata)."""
width = height = 0
fmt = path.suffix.lstrip(".").lower()
try:
with Image.open(path) as img:
width, height = img.size
fmt = (img.format or fmt).lower()
except Exception as exc: # unknown/user formats can raise non-OSError; see CLAUDE.md
log.debug("PIL could not open %s: %s", path, exc)
info = extract_c2pa_info(path)
issuer = str(info.get("issuer", ""))
synthid = "synthid_watermark" in info
return width, height, fmt, issuer, synthid
@click.group()
def cli() -> None:
"""Manage the local SynthID reference corpus."""
logging.basicConfig(level=logging.INFO, format="%(message)s")
@cli.command()
@click.argument("images", nargs=-1, required=True, type=click.Path(exists=True, path_type=Path))
@click.option("--label", required=True, type=click.Choice(LABELS), help="SynthID label.")
@click.option("--source", default="", help="Where the image came from (free text).")
@click.option("--model", default="", help="Generating model, e.g. gemini-3-pro.")
@click.option(
"--verified-via", "verified_via", default="none", type=click.Choice(VERIFIED_VIA), help="Ground-truth oracle."
)
@click.option("--notes", default="", help="Free-text notes (e.g. resolution batch).")
@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
def ingest(
images: tuple[Path, ...],
label: str,
source: str,
model: str,
verified_via: str,
notes: str,
root: Path,
) -> None:
"""Copy IMAGES into the corpus and append rows to the manifest."""
dest_dir = root / "images" / label
dest_dir.mkdir(parents=True, exist_ok=True)
rows = _read_manifest(root)
seen = {r["sha256"] for r in rows}
added = 0
new_rows: list[dict[str, str]] = []
for src in images:
digest = _sha256(src)
if digest in seen:
console.print(f" [dim]skip (duplicate sha256): {src.name}[/]")
continue
seen.add(digest)
width, height, fmt, issuer, synthid = _probe(src)
stored_name = f"{digest[:8]}-{src.name}"
shutil.copy2(src, dest_dir / stored_name)
new_rows.append(
{
"sha256": digest,
"filename": stored_name,
"label": label,
"source": source,
"model": model,
"width": str(width),
"height": str(height),
"format": fmt,
"c2pa_issuer": issuer,
"synthid_metadata": "yes" if synthid else "",
"verified_via": verified_via,
"added": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"notes": notes,
}
)
added += 1
flag = " [yellow](C2PA-SynthID)[/]" if synthid else ""
console.print(f" [green]+[/] {label}/{stored_name} {width}x{height} {fmt}{flag}")
if new_rows:
path = _manifest_path(root)
write_header = not path.exists()
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "a", newline="") as f:
writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
if write_header:
writer.writeheader()
writer.writerows(new_rows)
console.print(f"\n Ingested [bold]{added}[/] image(s); skipped {len(images) - added} duplicate(s).")
@cli.command()
@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
def status(root: Path) -> None:
"""Print corpus counts by label, resolution, and verification level."""
rows = _read_manifest(root)
if not rows:
console.print(" [dim]Corpus is empty.[/]")
return
by_label = Counter(r["label"] for r in rows)
by_verified = Counter(r["verified_via"] for r in rows)
by_res = Counter(f"{r['label']} {r['width']}x{r['height']}" for r in rows)
console.print(f"\n [bold]{len(rows)}[/] image(s) in {root}")
t = Table(title="By label", show_header=True, header_style="bold")
t.add_column("Label")
t.add_column("Count", justify="right")
for k in LABELS:
if by_label.get(k):
t.add_row(k, str(by_label[k]))
console.print(t)
t = Table(title="By label x resolution", show_header=True, header_style="bold")
t.add_column("Label / resolution")
t.add_column("Count", justify="right")
for k, v in sorted(by_res.items()):
t.add_row(k, str(v))
console.print(t)
t = Table(title="By verification", show_header=True, header_style="bold")
t.add_column("verified_via")
t.add_column("Count", justify="right")
for k, v in by_verified.most_common():
t.add_row(k, str(v))
console.print(t)
if __name__ == "__main__":
cli()
+2
View File
@@ -310,6 +310,8 @@ def cmd_metadata(
if has_ai:
console.print(f" [yellow]⚠[/] AI metadata detected in {source.name}:")
meta = get_ai_metadata(source)
if synthid := meta.get("synthid_watermark"):
console.print(f" [bold yellow]⚠ SynthID pixel watermark {synthid}[/]")
table = Table(show_header=True, header_style="bold")
table.add_column("Key", style="cyan")
table.add_column("Value")
+55 -71
View File
@@ -142,80 +142,45 @@ def has_ai_metadata(image_path: Path) -> bool:
return any(marker in data for marker in IPTC_AI_MARKERS)
def _scan_png_c2pa_chunk(image_path: Path) -> dict[str, str]:
"""Extract a human-readable summary of the C2PA manifest in a PNG file.
def synthid_source(image_path: Path) -> str | None:
"""Return the vendor name(s) if the image carries a SynthID pixel watermark.
PIL does not expose the caBX JUMBF box via ``img.info``, so we delegate
chunk extraction to the existing ``extract_c2pa_chunk`` helper and pull
key fields from the JUMBF payload without a full CBOR parser.
This is a *metadata-based* proxy: Google (Imagen/Gemini) and OpenAI
(ChatGPT/DALL-E/gpt-image) embed an invisible SynthID watermark alongside
a C2PA manifest, so a C2PA manifest signed by one of them on AI-generated
content implies SynthID in the pixels. Adobe Firefly / Microsoft Designer
sign C2PA but do not use SynthID, so they return None.
The verdict is reliable only while the C2PA manifest is intact -- absence
is not proof, because C2PA can be stripped while the pixel watermark
survives, and the pixel watermark itself is not locally detectable
(proprietary decoder).
Args:
image_path: Path to the image (PNG, JPEG, WebP, or ISOBMFF container).
Returns:
Comma-joined vendor name(s) (e.g. ``"OpenAI"``) or None.
"""
import re
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_vendors_in
from remove_ai_watermarks.noai.c2pa import extract_c2pa_chunk
# PNG: the caBX chunk parser gives a clean, structured issuer.
vendors = extract_c2pa_info(image_path).get("synthid_vendors")
if vendors:
return ", ".join(vendors)
raw = extract_c2pa_chunk(image_path)
if raw is None:
return {}
# extract_c2pa_chunk returns chunk_header (8 bytes) + data + crc (4 bytes).
payload = raw[8:-4]
result: dict[str, str] = {"c2pa_manifest": f"C2PA manifest ({len(payload)} bytes)"}
def _cbor_text_after(key: bytes) -> str | None:
"""Return the CBOR text-string immediately following ``key``.
Handles CBOR major-type 3 length prefixes: direct (0x60-0x77),
1-byte (0x78 NN), and 2-byte (0x79 NN NN).
"""
idx = payload.find(key)
if idx < 0:
return None
p = idx + len(key)
if p >= len(payload):
return None
head = payload[p]
if 0x60 <= head <= 0x77:
length, start = head - 0x60, p + 1
elif head == 0x78 and p + 1 < len(payload):
length, start = payload[p + 1], p + 2
elif head == 0x79 and p + 2 < len(payload):
length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
else:
return None
raw_str = payload[start : start + length]
try:
return raw_str.decode("utf-8")
except UnicodeDecodeError:
return raw_str.decode("latin1", errors="replace")
if generator := _cbor_text_after(b"name"):
result["claim_generator"] = generator
if spec := _cbor_text_after(b"specVersion"):
result["c2pa_spec"] = spec
dst_match = re.search(
rb"(http://cv\.iptc\.org/newscodes/digitalsourcetype/[A-Za-z0-9_-]+)",
payload,
)
if dst_match:
result["digital_source_type"] = dst_match.group(1).decode("latin1")
actions = sorted(
{m.decode("latin1") for m in re.findall(rb"c2pa\.(created|converted|edited|opened|placed)", payload)}
)
if actions:
result["c2pa_actions"] = ", ".join(actions)
# Scan cert DN printable strings for the signer org name.
signer_match = re.search(
rb"([A-Za-z][A-Za-z0-9 .,&'()\-]{2,48}OpenAI[A-Za-z0-9 .,&'()\-]{0,48})",
payload,
)
if signer_match:
result["signer"] = signer_match.group(1).decode("latin1").strip()
return result
# Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the
# C2PA manifest where the PNG parser can't reach it. Binary-scan for the
# same signal: a C2PA manifest from a SynthID-using issuer on AI content.
with open(image_path, "rb") as f:
data = f.read(1024 * 1024)
has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data
# Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia".
ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data
if not (has_c2pa and ai_source):
return None
matched = synthid_vendors_in(data)
return ", ".join(matched) if matched else None
def get_ai_metadata(image_path: Path) -> dict[str, str]:
@@ -229,6 +194,8 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
"""
from PIL import Image
from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_verdict
result: dict[str, str] = {}
with Image.open(image_path) as img:
@@ -241,7 +208,24 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
else:
result[key] = str(value)
result.update(_scan_png_c2pa_chunk(image_path))
# C2PA manifest fields from the single canonical parser (noai/c2pa.py).
c2pa = extract_c2pa_info(image_path)
for key in (
"c2pa_manifest",
"claim_generator",
"c2pa_spec",
"issuer",
"source_type",
"actions",
"synthid_watermark",
):
if key in c2pa:
result.setdefault(key, str(c2pa[key]))
# Non-PNG containers (JPEG/WebP/AVIF): extract_c2pa_info is PNG-only, so
# fall back to the format-agnostic source check for the SynthID verdict.
if "synthid_watermark" not in result and (vendor := synthid_source(image_path)):
result.setdefault("synthid_watermark", synthid_verdict(vendor))
return result
+67 -31
View File
@@ -29,6 +29,7 @@ from remove_ai_watermarks.noai.constants import (
C2PA_ISSUERS,
C2PA_SIGNATURES,
PNG_SIGNATURE,
SYNTHID_C2PA_ISSUERS,
)
@@ -129,10 +130,56 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]:
return c2pa_info
def _cbor_text_after(payload: bytes, key: bytes) -> str | None:
"""Return the CBOR text-string immediately following ``key`` in ``payload``.
Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte
(0x78 NN), and 2-byte (0x79 NN NN). This reads the actual encoded value, so
it avoids the byte-grabbing artifacts a loose regex produces (e.g. the
leading length byte showing up as ``fGPT-4o``).
"""
idx = payload.find(key)
if idx < 0:
return None
p = idx + len(key)
if p >= len(payload):
return None
head = payload[p]
if 0x60 <= head <= 0x77:
length, start = head - 0x60, p + 1
elif head == 0x78 and p + 1 < len(payload):
length, start = payload[p + 1], p + 2
elif head == 0x79 and p + 2 < len(payload):
length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
else:
return None
raw_str = payload[start : start + length]
try:
return raw_str.decode("utf-8")
except UnicodeDecodeError:
return raw_str.decode("latin1", errors="replace")
def synthid_verdict(vendors: str) -> str:
"""Human-readable SynthID-source verdict, shared by all callers."""
return f"likely present ({vendors} embeds SynthID with C2PA)"
def synthid_vendors_in(buffer: bytes) -> list[str]:
"""Return SynthID-using C2PA issuer names whose signature appears in ``buffer``.
Shared by the PNG caBX parser and the format-agnostic binary scan so both
apply the same SYNTHID_C2PA_ISSUERS rule against their respective bytes.
"""
return sorted({name for sig, name in C2PA_ISSUERS.items() if sig in buffer and sig in SYNTHID_C2PA_ISSUERS})
def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
"""Parse C2PA chunk data and populate info dictionary."""
c2pa_info["c2pa_manifest"] = f"C2PA manifest ({len(chunk_data)} bytes)"
# Find issuers
issuers = []
issuers: list[str] = []
for sig, name in C2PA_ISSUERS.items():
if sig in chunk_data:
issuers.append(name)
@@ -140,44 +187,22 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
c2pa_info["issuer"] = ", ".join(set(issuers))
# Find AI tools
ai_tools = []
ai_tools: list[str] = []
for sig, name in C2PA_AI_TOOLS.items():
if sig in chunk_data:
ai_tools.append(name)
if ai_tools:
c2pa_info["ai_tool"] = ", ".join(set(ai_tools))
# Extract software agent (multiple patterns)
patterns = [
rb"softwareAgent.*?dname([^\x00]+?)(?:q|l|m|n)",
rb"software_agent[^\x00]*?([A-Za-z0-9_\-\.]+)",
rb"Software[^\x00]*?([A-Za-z0-9_\-\. ]+)",
]
for pattern in patterns:
match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
if match:
agent = match.group(1).decode("utf-8", errors="ignore").strip()
if agent and len(agent) < 100:
c2pa_info["software_agent"] = agent
break
# Extract claim generator (multiple patterns)
claim_patterns = [
rb"claim_generator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
rb"claimGenerator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
rb"dname([^\x00]{3,50})(?:q|l|m|n|i)",
]
for pattern in claim_patterns:
match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
if match:
gen_name = match.group(1).decode("utf-8", errors="ignore").strip()
# Filter out common false positives
if gen_name and len(gen_name) < 100 and not gen_name.startswith(("\\x", "\\\\x")):
c2pa_info["claim_generator"] = gen_name
break
# Claim generator and spec version: read the CBOR text-string values
# directly (regex byte-grabbing produced artifacts like ``fGPT-4o``).
if generator := _cbor_text_after(chunk_data, b"name"):
c2pa_info["claim_generator"] = generator
if spec := _cbor_text_after(chunk_data, b"specVersion"):
c2pa_info["c2pa_spec"] = spec
# Find actions
actions = []
actions: list[str] = []
for sig, name in C2PA_ACTIONS.items():
if sig in chunk_data:
actions.append(name)
@@ -192,12 +217,23 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
c2pa_info["timestamps"] = [t.decode("utf-8") for t in timestamp_matches[:3]]
# Find digital source type
ai_source = False
if b"trainedAlgorithmicMedia" in chunk_data:
c2pa_info["source_type"] = "trainedAlgorithmicMedia (AI-generated)"
ai_source = True
elif b"algorithmicMedia" in chunk_data:
c2pa_info["source_type"] = "algorithmicMedia"
elif b"compositeWithTrainedAlgorithmicMedia" in chunk_data:
c2pa_info["source_type"] = "compositeWithTrainedAlgorithmicMedia (AI-enhanced)"
ai_source = True
# SynthID pixel-watermark proxy: a C2PA manifest from a SynthID-using
# vendor (Google/OpenAI) on AI-generated content implies an invisible
# SynthID watermark in the pixels (see SYNTHID_C2PA_ISSUERS).
synthid_vendors = synthid_vendors_in(chunk_data)
if synthid_vendors and ai_source:
c2pa_info["synthid_vendors"] = synthid_vendors
c2pa_info["synthid_watermark"] = synthid_verdict(", ".join(synthid_vendors))
def extract_c2pa_chunk(image_path: Path) -> bytes | None:
@@ -87,6 +87,27 @@ C2PA_ISSUERS = {
b"Truepic": "Truepic",
}
# C2PA issuers whose signed outputs also carry an invisible SynthID pixel
# watermark -- a metadata proxy for "SynthID is in the pixels":
# - Google (Imagen/Gemini): embeds SynthID, long-standing (DeepMind docs).
# - OpenAI (ChatGPT/Codex/API): pairs SynthID with C2PA since ~2026-05-20.
# Confirmed by OpenAI's Help Center ("C2PA and SynthID in OpenAI-generated
# images", updated 2026-05-21): "Images generated with ChatGPT, Codex, and
# our API include both C2PA metadata and SynthID watermarks." OpenAI also
# notes a signal may be absent if "the image was created before these
# signals were available" -- so OpenAI images from BEFORE the rollout carry
# C2PA WITHOUT SynthID (e.g. data/samples/openai-images-2/amur-leopard.png,
# C2PA timestamp 2026-04-22). For OpenAI the proxy is therefore "likely",
# not certain; the verdict string is hedged accordingly. OpenAI's own oracle
# is openai.com/verify (Google's is the Gemini app "Verify with SynthID").
# The issuer byte ("OpenAI"/"Google") is verified locally against data/samples;
# the SynthID pairing is documented behavior (Google: DeepMind; OpenAI: above).
# Adobe Firefly and Microsoft Designer sign C2PA but do NOT use SynthID, so a
# C2PA manifest alone is not a SynthID signal -- the issuer is. The pixel
# watermark is not locally detectable (proprietary decoder); the C2PA companion
# is the proxy, and only while the manifest is intact.
SYNTHID_C2PA_ISSUERS = frozenset({b"Google", b"OpenAI"})
# C2PA known AI tools
C2PA_AI_TOOLS = {
b"GPT-4o": "GPT-4o",
@@ -106,6 +127,8 @@ C2PA_ACTIONS = {
b"c2pa.filtered": "filtered",
b"c2pa.cropped": "cropped",
b"c2pa.resized": "resized",
b"c2pa.opened": "opened",
b"c2pa.placed": "placed",
}
# PNG signature
+108
View File
@@ -4,6 +4,7 @@ from __future__ import annotations
from pathlib import Path
import pytest
from PIL import Image
from PIL.PngImagePlugin import PngInfo
@@ -12,8 +13,12 @@ from remove_ai_watermarks.metadata import (
get_ai_metadata,
has_ai_metadata,
remove_ai_metadata,
synthid_source,
)
# Real, committed C2PA sample images used to ground the SynthID-source tests.
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
# ── Key detection ───────────────────────────────────────────────────
@@ -144,6 +149,109 @@ class TestGetAiMetadata:
meta = get_ai_metadata(tmp_clean_png)
assert meta == {}
def test_long_value_is_truncated(self, tmp_path: Path):
img = Image.new("RGB", (32, 32))
pnginfo = PngInfo()
pnginfo.add_text("parameters", "x" * 300)
path = tmp_path / "long.png"
img.save(path, pnginfo=pnginfo)
meta = get_ai_metadata(path)
assert meta["parameters"].endswith("")
assert len(meta["parameters"]) <= 205
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
class TestGetAiMetadataRealSample:
"""get_ai_metadata surfaces the consolidated C2PA fields on real images."""
def test_openai_sample_fields(self):
meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
assert "claim_generator" in meta
assert "OpenAI" in meta["issuer"]
assert "OpenAI" in meta["synthid_watermark"]
assert "trainedAlgorithmicMedia" in meta["source_type"]
@pytest.mark.parametrize(
"marker",
[
b"trainedAlgorithmicMedia",
b"compositeSynthetic",
b"algorithmicMedia",
b"compositeWithTrainedAlgorithmicMedia",
],
)
def test_has_ai_metadata_detects_each_iptc_marker(tmp_path: Path, marker: bytes):
"""Each IPTC digitalSourceType AI marker in XMP triggers detection."""
path = tmp_path / "iptc.jpg"
path.write_bytes(b"\xff\xd8\xff\xe1<x:xmpmeta>" + marker + b"</x:xmpmeta>\xff\xd9")
assert has_ai_metadata(path)
# ── SynthID-source detection (metadata proxy) ────────────────────────
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
class TestSynthIDSource:
"""SynthID detection via the C2PA companion manifest.
Google (Imagen/Gemini) and OpenAI (ChatGPT/DALL-E/gpt-image) pair an
invisible SynthID pixel watermark with a C2PA manifest. Adobe Firefly and
Microsoft Designer sign C2PA Content Credentials but do NOT use SynthID,
so the discriminating signal is the C2PA *issuer*, not the mere presence
of a manifest. These tests run against real, committed sample images.
"""
def test_openai_chatgpt_is_synthid_source(self):
assert synthid_source(SAMPLES_DIR / "chatgpt-1.png") == "OpenAI"
def test_openai_verdict_in_get_ai_metadata(self):
meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
assert "synthid_watermark" in meta
assert "OpenAI" in meta["synthid_watermark"]
def test_adobe_firefly_is_not_synthid_source(self):
# Adobe signs C2PA (trainedAlgorithmicMedia) but embeds no SynthID.
assert synthid_source(SAMPLES_DIR / "firefly-1.png") is None
assert "synthid_watermark" not in get_ai_metadata(SAMPLES_DIR / "firefly-1.png")
def test_non_ai_image_is_not_synthid_source(self):
assert synthid_source(SAMPLES_DIR / "not-ai-1.jpeg") is None
class TestSynthIDSourceNonPng:
"""SynthID-source detection must work beyond PNG.
ChatGPT/Gemini images saved as JPEG/WebP/AVIF carry their C2PA manifest in
a non-PNG container (JPEG APP11, ISOBMFF uuid box), so the PNG caBX parser
misses them. These use synthetic byte blobs (real fixtures aren't shipped).
"""
def _c2pa_jpeg(self, tmp_path: Path, name: str, issuer: bytes, marker: bytes = b"trainedAlgorithmicMedia") -> Path:
path = tmp_path / name
# Minimal JPEG shell with an embedded C2PA-ish blob.
blob = b"jumbc2pa" + issuer + b"..." + marker
path.write_bytes(b"\xff\xd8\xff\xe1" + blob + b"\xff\xd9")
return path
def test_openai_c2pa_in_jpeg(self, tmp_path: Path):
path = self._c2pa_jpeg(tmp_path, "chatgpt.jpg", b"OpenAI")
assert synthid_source(path) == "OpenAI"
def test_google_c2pa_in_jpeg(self, tmp_path: Path):
path = self._c2pa_jpeg(tmp_path, "gemini.jpg", b"Google")
assert synthid_source(path) == "Google LLC"
def test_adobe_c2pa_in_jpeg_is_none(self, tmp_path: Path):
# Adobe signs C2PA but embeds no SynthID.
path = self._c2pa_jpeg(tmp_path, "firefly.jpg", b"Adobe")
assert synthid_source(path) is None
def test_openai_without_ai_marker_is_none(self, tmp_path: Path):
# Issuer present but no AI digital-source marker -> not a SynthID source.
path = self._c2pa_jpeg(tmp_path, "edited.jpg", b"OpenAI", marker=b"")
assert synthid_source(path) is None
# ── remove_ai_metadata ──────────────────────────────────────────────
+139
View File
@@ -2,10 +2,18 @@
from __future__ import annotations
import struct
from pathlib import Path
import pytest
from remove_ai_watermarks.noai.c2pa import (
_cbor_text_after,
extract_c2pa_chunk,
extract_c2pa_info,
has_c2pa_metadata,
inject_c2pa_chunk,
synthid_verdict,
)
from remove_ai_watermarks.noai.cleaner import (
has_ai_content,
@@ -26,6 +34,10 @@ from remove_ai_watermarks.noai.extractor import (
get_ai_metadata_summary,
has_ai_metadata,
)
from remove_ai_watermarks.noai.isobmff import (
is_isobmff,
strip_c2pa_boxes,
)
# ── Constants ───────────────────────────────────────────────────────
@@ -128,3 +140,130 @@ class TestC2PA:
def test_c2pa_returns_false_for_non_png(self, tmp_jpeg_path):
assert not has_c2pa_metadata(tmp_jpeg_path)
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
class TestC2PARealSamples:
"""Parser behavior on real committed C2PA images."""
def test_detects_c2pa_in_openai_png(self):
assert has_c2pa_metadata(SAMPLES_DIR / "chatgpt-1.png")
def test_extract_info_openai_fields(self):
info = extract_c2pa_info(SAMPLES_DIR / "chatgpt-1.png")
assert info["has_c2pa"] is True
assert "OpenAI" in info["issuer"]
assert "c2pa_manifest" in info # "C2PA manifest (N bytes)"
assert "trainedAlgorithmicMedia" in info["source_type"]
# CBOR-clean claim generator, no regex artifacts (e.g. "fGPT-4o").
assert info["claim_generator"]
assert not info["claim_generator"].startswith("f")
assert "synthid_watermark" in info
def test_extract_info_adobe_has_no_synthid(self):
info = extract_c2pa_info(SAMPLES_DIR / "firefly-1.png")
assert "Adobe" in info["issuer"]
assert "synthid_watermark" not in info
def test_extract_chunk_returns_bytes(self):
chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
assert chunk is not None
assert chunk[4:8] == b"caBX" # chunk type in the 8-byte header
def test_inject_round_trip(self, tmp_clean_png, tmp_path):
"""Extract a real C2PA chunk, inject into a clean PNG, re-detect."""
chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
out = tmp_path / "injected.png"
inject_c2pa_chunk(tmp_clean_png, out, chunk)
assert has_c2pa_metadata(out)
assert "OpenAI" in extract_c2pa_info(out)["issuer"]
class TestC2PAInjectValidation:
def test_inject_rejects_non_png(self, tmp_path):
with pytest.raises(ValueError, match="only supported for PNG"):
inject_c2pa_chunk(tmp_path / "in.jpg", tmp_path / "out.png", b"")
# ── CBOR text extraction (parser internals) ─────────────────────────
class TestCborTextAfter:
"""_cbor_text_after handles the three CBOR text-string length prefixes."""
def test_direct_length(self):
# major-type 3, direct length (0x60 + len). "abc" -> 0x63.
payload = b"name" + bytes([0x63]) + b"abc"
assert _cbor_text_after(payload, b"name") == "abc"
def test_one_byte_length(self):
s = b"x" * 30
payload = b"name" + bytes([0x78, 30]) + s
assert _cbor_text_after(payload, b"name") == "x" * 30
def test_two_byte_length(self):
s = b"y" * 300
payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s
assert _cbor_text_after(payload, b"name") == "y" * 300
def test_key_not_found_returns_none(self):
assert _cbor_text_after(b"nothing here", b"name") is None
def test_key_at_end_returns_none(self):
assert _cbor_text_after(b"prefixname", b"name") is None
def test_invalid_head_returns_none(self):
# 0x00 is not a text-string head.
assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
def test_latin1_fallback_on_invalid_utf8(self):
payload = b"name" + bytes([0x61]) + b"\xff" # len 1, invalid utf-8
assert _cbor_text_after(payload, b"name") is not None
class TestSynthIDVerdict:
def test_format(self):
assert synthid_verdict("OpenAI") == "likely present (OpenAI embeds SynthID with C2PA)"
def test_multiple_vendors(self):
assert "Google LLC, OpenAI" in synthid_verdict("Google LLC, OpenAI")
# ── ISOBMFF (AVIF / HEIF / JPEG-XL container stripping) ──────────────
FTYP = b"\x00\x00\x00\x18ftypavif\x00\x00\x00\x00avifmif1" # 24-byte ftyp box
class TestISOBMFF:
def test_is_isobmff_true(self):
assert is_isobmff(FTYP)
def test_is_isobmff_false_for_png(self):
assert not is_isobmff(b"\x89PNG\r\n\x1a\n\x00\x00")
def test_is_isobmff_false_for_short(self):
assert not is_isobmff(b"abc")
def test_strips_jpegxl_jumb_box(self):
"""JPEG-XL stores JUMBF in a ``jumb`` box, always stripped."""
jumb = struct.pack(">I", 8 + 5) + b"jumb" + b"hello"
cleaned, stripped = strip_c2pa_boxes(FTYP + jumb)
assert stripped == 1
assert cleaned == FTYP
def test_keeps_non_c2pa_box_with_64bit_size(self):
"""size==1 means a 64-bit largesize follows; non-C2PA box is kept."""
payload = b"\x00" * 8
box = b"\x00\x00\x00\x01" + b"free" + struct.pack(">Q", 16 + len(payload)) + payload
cleaned, stripped = strip_c2pa_boxes(FTYP + box)
assert stripped == 0
assert cleaned == FTYP + box
def test_malformed_box_does_not_crash(self):
# A box claiming size 4 (< 8-byte header) must terminate iteration safely.
cleaned, stripped = strip_c2pa_boxes(FTYP + b"\x00\x00\x00\x04XXXX")
assert stripped == 0
assert cleaned.startswith(FTYP)
+18
View File
@@ -87,6 +87,24 @@ class TestModelProfiles:
def test_recommended_strength_medium(self):
assert get_recommended_strength("unknown_type") == 0.35
@pytest.mark.parametrize("wm_type", ["stegastamp", "stegasamp", "treering", "ringid"])
def test_high_perturbation_watermark_types(self, wm_type):
"""Robust spatial watermarks need aggressive (0.7) regeneration."""
assert get_recommended_strength(wm_type) == 0.7
@pytest.mark.parametrize("wm_type", ["stablesignature", "dwtectsvd", "rivagan", "ssl", "hidden"])
def test_low_perturbation_watermark_types(self, wm_type):
"""Fragile frequency/latent watermarks break at low (0.04) strength."""
assert get_recommended_strength(wm_type) == 0.04
def test_strength_match_is_case_insensitive(self):
assert get_recommended_strength("TreeRing") == 0.7
assert get_recommended_strength("StableSignature") == 0.04
def test_strength_matches_substring_in_descriptive_name(self):
# e.g. a CLI passing "treering_v2" or "synthid-stegastamp" still maps.
assert get_recommended_strength("treering_v2") == 0.7
# ── Format utilities ────────────────────────────────────────────────
+66
View File
@@ -0,0 +1,66 @@
"""Tests for the SynthID corpus ingestion script."""
from __future__ import annotations
import csv
import sys
from pathlib import Path
import pytest
from click.testing import CliRunner
# scripts/ is not an installed package; add it to the path for import.
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "scripts"))
import synthid_corpus
SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
def _manifest_rows(root: Path) -> list[dict[str, str]]:
with open(root / "manifest.csv", newline="") as f:
return list(csv.DictReader(f))
@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
class TestIngest:
def test_ingest_openai_flags_synthid_metadata(self, tmp_path: Path):
runner = CliRunner()
result = runner.invoke(
synthid_corpus.cli,
["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)],
)
assert result.exit_code == 0, result.output
rows = _manifest_rows(tmp_path)
assert len(rows) == 1
row = rows[0]
assert row["label"] == "pos"
assert row["synthid_metadata"] == "yes"
assert int(row["width"]) > 0
assert int(row["height"]) > 0
# The copied file lands under images/pos/ with a sha-prefixed name.
assert (tmp_path / "images" / "pos" / row["filename"]).exists()
def test_ingest_firefly_not_flagged(self, tmp_path: Path):
runner = CliRunner()
runner.invoke(
synthid_corpus.cli,
["ingest", str(SAMPLES_DIR / "firefly-1.png"), "--label", "neg", "--root", str(tmp_path)],
)
rows = _manifest_rows(tmp_path)
assert len(rows) == 1
assert rows[0]["synthid_metadata"] == "" # Adobe signs C2PA but not SynthID
def test_ingest_dedupes_by_sha256(self, tmp_path: Path):
runner = CliRunner()
args = ["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)]
runner.invoke(synthid_corpus.cli, args)
runner.invoke(synthid_corpus.cli, args) # second time: duplicate
assert len(_manifest_rows(tmp_path)) == 1
def test_status_on_empty_corpus(self, tmp_path: Path):
runner = CliRunner()
result = runner.invoke(synthid_corpus.cli, ["status", "--root", str(tmp_path)])
assert result.exit_code == 0
assert "empty" in result.output.lower()