feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests

Detect SynthID-bearing images via their C2PA companion: a manifest signed by a SynthID-using vendor (Google/OpenAI) on AI-generated content implies an invisible SynthID pixel watermark. Verified end-to-end against the vendor oracles (openai.com/verify, Gemini "Verify with SynthID"). - metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata, surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser + JPEG/WebP/AVIF/HEIF/JXL binary scan). - constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions. - c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from metadata; shared synthid_verdict / synthid_vendors_in helpers. - corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/ (manifest tracked, images gitignored) for a labeled reference set. - tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF container stripping, all IPTC AI markers, and invisible watermark strength tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...). Pixel-level SynthID detection remains out of reach locally (Google's decoder is proprietary); a from-scratch spectral pilot confirmed it does not separate real content. See CLAUDE.md for the full evaluation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-10 10:18:36 +02:00 · 2026-05-24 11:32:46 -07:00
parent c1ff4e1cd9
commit f07ce10c72
14 changed files with 785 additions and 107 deletions
@@ -21,6 +21,9 @@ Thumbs.db
 # Test results
 data/results/

+# SynthID corpus images (manifest.csv + README.md stay tracked)
+data/synthid_corpus/images/
+
 # Reference materials
 _refs/

@@ -30,3 +33,4 @@ yolov8n.pt

 # Claude Code local settings
 .claude/settings.local.json
+data/synthid_corpus/refs/
@@ -11,6 +11,10 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r
 ## Test and lint

 - `bash maintain.sh` — uv-outdated, uv-secure, ruff check/fix, ruff format, pyright, pytest -n auto
+- `maintain.sh` does not currently finish green (pre-existing, not per-change): `uv-secure` aborts on a fixable transitive `idna` vuln, and strict pyright carries debt in `remove_ai_metadata` / `cli.py` (untyped piexif/PIL/click/rich). To gate a change, run `uv run ruff check`, `uv run pyright <changed files>`, `uv run pytest` directly.
+- Run `uv run` from the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch.
+- Metadata/C2PA tests assert against real committed fixtures in `data/samples/` (`chatgpt-*.png` = OpenAI C2PA, `firefly-1.png` = Adobe, `not-ai-*` = clean); synthetic byte blobs cover the JPEG/ISOBMFF format paths.
+- SynthID reference corpus: `scripts/synthid_corpus.py` ingests labeled images into `data/synthid_corpus/` (`manifest.csv` tracked, `images/` gitignored); see its README for the collection protocol and verification oracles.

 ## Configuration

@@ -21,14 +25,14 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r

 - `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor.
 - `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, C2PA_ISSUERS, and `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI). Add a new issuer here, not inline.
- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None; `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
+- `metadata.py` — `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout.
 - `face_protector.py` — YOLO detect + soft-blend pattern; mirror this for any "protect region during diffusion" features

 ## Known limitations

 - `invisible` pipeline downscales to model-native resolution (1024 px for SDXL) before diffusion. Degrades fine text in infographics. Tracked; fix is tile-based diffusion.
- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs)
+- Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files.
 - `ultralytics` monkey-patches `PIL.Image.open` and tries to autoload `pi_heif`. When `pi_heif` is missing, opening files raises `ModuleNotFoundError`, not `UnidentifiedImageError`. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`.
 - Metadata detection for AVIF/HEIF/JPEG-XL relies on a binary scan for `C2PA_UUID` + `IPTC_AI_MARKERS`. C2PA removal in those containers is implemented via `noai/isobmff.py` (top-level ``uuid`` / ``jumb`` box stripper, no re-encoding). EXIF/XMP boxes inside those containers are not yet scrubbed.
- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus.
- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
+- **SynthID detection is metadata-only.** There is no reliable *local* detector of the SynthID *pixel* watermark — Google's decoder is proprietary, no public spec or API (only a waitlisted portal). We detect SynthID by its C2PA companion (`synthid_source` / `SYNTHID_C2PA_ISSUERS`), which is reliable while the manifest is intact but says nothing once C2PA is stripped. Google→SynthID is long-standing; OpenAI→SynthID is confirmed by OpenAI's Help Center (ChatGPT/Codex/API "include both C2PA metadata and SynthID watermarks", updated 2026-05-21) but time-gated (pre-rollout OpenAI images carry C2PA without SynthID), so the OpenAI verdict is hedged "likely". Oracles: Gemini app "Verify with SynthID" (Google), openai.com/verify (OpenAI). The spectral phase-coherence approach from `github.com/aloshdenny/reverse-SynthID` was evaluated (May 2026) and **does not work for real-content detection**: on its own shipped codebook + validation set, watermarked and cleaned images were indistinguishable (conf within noise, cleaned often higher); it only fires on pure-black 1024x1024 reference images at exact resolution (the controlled case it was calibrated on). The README's "90% / conf=0.91" reproduces only in that lab condition. Do not build a production detector on it; if revisited, it is experimental/diagnostic only and needs a per-resolution, per-model reference corpus. A from-scratch gpt-image pilot (2026-05-24) confirmed this independently: 5 independent solid-black gpt-image outputs share a near-identical fixed signature (pairwise residual correlation **0.92**, avg-template retains 97% energy), so the watermark/carrier IS strongly present and consistent on flat content — but the carrier frequencies extracted from it do NOT discriminate real content (carrier-to-random ratio: cleaned 1.86 > watermarked 1.53; a non-gpt-image image scored highest at 3.67). The signature drowns in content texture. Net: a perfectly consistent solid-color signature still yields no real-content pixel detector with magnitude/carrier methods.
+- **SynthID v2 vs default pipeline:** the SDXL-based default profile (since May 2026) defeats SynthID v2. **Verified end-to-end (May 2026):** local SDXL run on a Gemini 3 Pro output, checked via the Gemini app's "Verify with SynthID" feature, returned "no SynthID watermark detected". Also confirmed against **OpenAI's** SynthID (2026-05-23): a fresh ChatGPT/gpt-image output read "SynthID detected" on openai.com/verify before the local SDXL run and "SynthID not detected" after (corpus regression chain: pos `4ef377bd` -> cleaned `47188e88`). The same configuration is used in raiw-app production (`fal-ai/fast-sdxl` at native ~1024 px, strength 0.05, steps 50). SD-1.5 dreamshaper at 768 px was previously the default and does NOT defeat v2 — verified empirically against the same feature (strength 0.04, 0.10, and elastic warp α∈{5,8} all flagged positive). That SD-1.5 path was removed; only `default` (SDXL) and `ctrlregen` profiles remain.
@@ -187,6 +187,10 @@ remove-ai-watermarks visible image.png -o clean.png
 remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0

 # Check / strip AI metadata (C2PA, EXIF, "Made with AI" labels)
+# --check also flags SynthID-bearing sources: a C2PA manifest signed by
+# Google or OpenAI implies an invisible SynthID watermark in the pixels
+# (both vendors pair the two). Adobe Firefly / Microsoft sign C2PA without
+# SynthID, so they are reported as C2PA only.
 remove-ai-watermarks metadata image.png --check
 remove-ai-watermarks metadata image.png --remove

@@ -253,7 +257,7 @@ pip install certifi

 Tracked but not yet implemented:

- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. Open.
+- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. The spectral phase-coherence surrogate from [reverse-SynthID](https://github.com/aloshdenny/reverse-SynthID) was evaluated and does not separate watermarked from cleaned real-content images (it only fires on controlled solid-color references at exact resolution), so it is not a usable oracle. Open.
 - **AVIF / HEIF / JPEG-XL detection limits**. Removal strips top-level C2PA `uuid` and JUMBF `jumb` boxes. EXIF/XMP boxes inside these containers are not yet scrubbed (PNG and JPEG are fully covered).
 - **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.

@@ -0,0 +1,75 @@
+# SynthID reference corpus
+
+A locally-collected, labeled image corpus for SynthID work. Two downstream uses:
+
+1. **Per-resolution spectral codebook** for an experimental SynthID detector
+   (carrier frequencies are resolution-dependent, so labels must record the
+   exact native resolution).
+2. **Removal regression set** — verify that our pipeline turns a SynthID-positive
+   image into a negative one.
+
+There is no reliable local detector of the SynthID pixel watermark (Google's
+decoder is proprietary). The ground-truth label therefore comes from an
+external oracle, recorded per image in `verified_via` (see below).
+
+## Layout
+
+```
+data/synthid_corpus/
+  README.md        # this protocol (committed)
+  manifest.csv     # labels + provenance (committed, reviewable)
+  images/          # the actual files (gitignored, local-only)
+    pos/           # SynthID present
+    neg/           # SynthID absent
+    cleaned/       # our pipeline output from a pos image
+```
+
+Images are gitignored on purpose: the corpus is large, may contain personal or
+licensed content, and SynthID-positive outputs are best kept local. The
+`manifest.csv` (sha256 + labels + extracted metadata) is the durable artifact.
+
+## Verification levels (`verified_via`)
+
+Ground-truth quality, strongest first:
+
+- `gemini-app` — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
+- `openai-verify` — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
+- `synthid-portal` — checked via Google's SynthID Detector portal.
+- `c2pa-metadata` — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
+- `third-party` — label asserted by an external dataset, not independently verified.
+- `none` — unverified.
+
+Prefer `gemini-app` for any image that will train the codebook or gate a test.
+
+## What to collect
+
+For the **codebook** (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):
+
+- 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per
+  resolution materially improves carrier discovery).
+- At each target resolution, also a batch of **pure-black (#000000)** and
+  **pure-white (#FFFFFF)** fills generated by the SynthID model — these isolate
+  the content-independent carrier (the watermark is most of the signal there).
+
+For the **regression set**:
+
+- A handful of `pos` images, their `cleaned` counterparts (run through our
+  pipeline), and the cleaned re-verified via `gemini-app` (should read negative).
+- `neg` controls: non-AI photos and outputs from non-SynthID models (SD,
+  Midjourney, Firefly) verified negative.
+
+Avoid personal or identifiable content; the corpus stays local.
+
+## Ingesting
+
+Use `scripts/synthid_corpus.py` — it copies a file in, records its sha256,
+resolution, format, and C2PA issuer (via our own detector), and appends a row
+to `manifest.csv`:
+
+```bash
+uv run python scripts/synthid_corpus.py ingest path/to/*.png \
+    --label pos --source "Gemini app" --model gemini-3-pro \
+    --verified-via gemini-app --notes "1024x1024 batch"
+
+uv run python scripts/synthid_corpus.py status   # counts by label / resolution / verification
+```
@@ -0,0 +1,4 @@
+sha256,filename,label,source,model,width,height,format,c2pa_issuer,synthid_metadata,verified_via,added,notes
+4ef377bde1a1d4eff141972841938643b173f5052992a018b9a21b31ac31731e,"4ef377bd-ChatGPT Image May 23, 2026, 02_43_02 PM.png",pos,ChatGPT,gpt-image,1254,1254,png,OpenAI,yes,openai-verify,2026-05-23T21:48:12Z,fresh post-rollout 2026-05-23; openai.com/verify: SynthID+C2PA detected
+d09f84c0e4c6d8b336bf4a9a7277314e940dcb5052ae7051e785cbb3bb42d656,d09f84c0-Gemini_Generated_Image_vq7wkwvq7wkwvq7w.png,pos,Gemini app,gemini,2816,1536,png,Google LLC,yes,c2pa-metadata,2026-05-23T21:52:40Z,"user: latest Gemini, SynthID v2"
+47188e88f956291bd38ab6906e5f21eb273d4a697ddc8b4479deac9f48915e1a,47188e88-disco_synthid_removed.png,cleaned,our pipeline (invisible/SDXL),stabilityai/stable-diffusion-xl-base-1.0,1254,1254,png,,,openai-verify,2026-05-23T22:06:54Z,cleaned from 4ef377bd disco; openai.com/verify: SynthID NOT detected (defeated)
@@ -0,0 +1,211 @@
+"""Ingest and inspect the local SynthID reference corpus.
+
+Copies images into ``data/synthid_corpus/images/<label>/`` and records one row
+per image in ``manifest.csv`` (sha256, resolution, format, C2PA issuer, and the
+external verification level). Dogfoods the project's own C2PA detector so the
+recorded metadata matches what the library sees.
+
+See ``data/synthid_corpus/README.md`` for the collection protocol.
+
+Usage:
+    uv run python scripts/synthid_corpus.py ingest IMAGES... --label pos \\
+        --source "Gemini app" --model gemini-3-pro --verified-via gemini-app
+    uv run python scripts/synthid_corpus.py status
+"""
+
+from __future__ import annotations
+
+import csv
+import hashlib
+import logging
+import shutil
+from collections import Counter
+from datetime import datetime, timezone
+from pathlib import Path
+
+import click
+from PIL import Image
+from rich.console import Console
+from rich.table import Table
+
+from remove_ai_watermarks.noai.c2pa import extract_c2pa_info
+
+log = logging.getLogger(__name__)
+console = Console()
+
+DEFAULT_ROOT = Path(__file__).resolve().parent.parent / "data" / "synthid_corpus"
+LABELS = ("pos", "neg", "cleaned")
+VERIFIED_VIA = ("gemini-app", "openai-verify", "synthid-portal", "c2pa-metadata", "third-party", "none")
+FIELDNAMES = [
+    "sha256",
+    "filename",
+    "label",
+    "source",
+    "model",
+    "width",
+    "height",
+    "format",
+    "c2pa_issuer",
+    "synthid_metadata",
+    "verified_via",
+    "added",
+    "notes",
+]
+
+
+def _manifest_path(root: Path) -> Path:
+    return root / "manifest.csv"
+
+
+def _read_manifest(root: Path) -> list[dict[str, str]]:
+    path = _manifest_path(root)
+    if not path.exists():
+        return []
+    with open(path, newline="") as f:
+        return list(csv.DictReader(f))
+
+
+def _sha256(path: Path) -> str:
+    h = hashlib.sha256()
+    with open(path, "rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    return h.hexdigest()
+
+
+def _probe(path: Path) -> tuple[int, int, str, str, bool]:
+    """Return (width, height, format, c2pa_issuer, synthid_metadata)."""
+    width = height = 0
+    fmt = path.suffix.lstrip(".").lower()
+    try:
+        with Image.open(path) as img:
+            width, height = img.size
+            fmt = (img.format or fmt).lower()
+    except Exception as exc:  # unknown/user formats can raise non-OSError; see CLAUDE.md
+        log.debug("PIL could not open %s: %s", path, exc)
+
+    info = extract_c2pa_info(path)
+    issuer = str(info.get("issuer", ""))
+    synthid = "synthid_watermark" in info
+    return width, height, fmt, issuer, synthid
+
+
+@click.group()
+def cli() -> None:
+    """Manage the local SynthID reference corpus."""
+    logging.basicConfig(level=logging.INFO, format="%(message)s")
+
+
+@cli.command()
+@click.argument("images", nargs=-1, required=True, type=click.Path(exists=True, path_type=Path))
+@click.option("--label", required=True, type=click.Choice(LABELS), help="SynthID label.")
+@click.option("--source", default="", help="Where the image came from (free text).")
+@click.option("--model", default="", help="Generating model, e.g. gemini-3-pro.")
+@click.option(
+    "--verified-via", "verified_via", default="none", type=click.Choice(VERIFIED_VIA), help="Ground-truth oracle."
+)
+@click.option("--notes", default="", help="Free-text notes (e.g. resolution batch).")
+@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
+def ingest(
+    images: tuple[Path, ...],
+    label: str,
+    source: str,
+    model: str,
+    verified_via: str,
+    notes: str,
+    root: Path,
+) -> None:
+    """Copy IMAGES into the corpus and append rows to the manifest."""
+    dest_dir = root / "images" / label
+    dest_dir.mkdir(parents=True, exist_ok=True)
+
+    rows = _read_manifest(root)
+    seen = {r["sha256"] for r in rows}
+    added = 0
+    new_rows: list[dict[str, str]] = []
+
+    for src in images:
+        digest = _sha256(src)
+        if digest in seen:
+            console.print(f"  [dim]skip (duplicate sha256): {src.name}[/]")
+            continue
+        seen.add(digest)
+
+        width, height, fmt, issuer, synthid = _probe(src)
+        stored_name = f"{digest[:8]}-{src.name}"
+        shutil.copy2(src, dest_dir / stored_name)
+
+        new_rows.append(
+            {
+                "sha256": digest,
+                "filename": stored_name,
+                "label": label,
+                "source": source,
+                "model": model,
+                "width": str(width),
+                "height": str(height),
+                "format": fmt,
+                "c2pa_issuer": issuer,
+                "synthid_metadata": "yes" if synthid else "",
+                "verified_via": verified_via,
+                "added": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+                "notes": notes,
+            }
+        )
+        added += 1
+        flag = " [yellow](C2PA-SynthID)[/]" if synthid else ""
+        console.print(f"  [green]+[/] {label}/{stored_name}  {width}x{height} {fmt}{flag}")
+
+    if new_rows:
+        path = _manifest_path(root)
+        write_header = not path.exists()
+        path.parent.mkdir(parents=True, exist_ok=True)
+        with open(path, "a", newline="") as f:
+            writer = csv.DictWriter(f, fieldnames=FIELDNAMES)
+            if write_header:
+                writer.writeheader()
+            writer.writerows(new_rows)
+
+    console.print(f"\n  Ingested [bold]{added}[/] image(s); skipped {len(images) - added} duplicate(s).")
+
+
+@cli.command()
+@click.option("--root", type=click.Path(path_type=Path), default=DEFAULT_ROOT, help="Corpus root.")
+def status(root: Path) -> None:
+    """Print corpus counts by label, resolution, and verification level."""
+    rows = _read_manifest(root)
+    if not rows:
+        console.print("  [dim]Corpus is empty.[/]")
+        return
+
+    by_label = Counter(r["label"] for r in rows)
+    by_verified = Counter(r["verified_via"] for r in rows)
+    by_res = Counter(f"{r['label']}  {r['width']}x{r['height']}" for r in rows)
+
+    console.print(f"\n  [bold]{len(rows)}[/] image(s) in {root}")
+
+    t = Table(title="By label", show_header=True, header_style="bold")
+    t.add_column("Label")
+    t.add_column("Count", justify="right")
+    for k in LABELS:
+        if by_label.get(k):
+            t.add_row(k, str(by_label[k]))
+    console.print(t)
+
+    t = Table(title="By label x resolution", show_header=True, header_style="bold")
+    t.add_column("Label / resolution")
+    t.add_column("Count", justify="right")
+    for k, v in sorted(by_res.items()):
+        t.add_row(k, str(v))
+    console.print(t)
+
+    t = Table(title="By verification", show_header=True, header_style="bold")
+    t.add_column("verified_via")
+    t.add_column("Count", justify="right")
+    for k, v in by_verified.most_common():
+        t.add_row(k, str(v))
+    console.print(t)
+
+
+if __name__ == "__main__":
+    cli()
@@ -310,6 +310,8 @@ def cmd_metadata(
        if has_ai:
            console.print(f"  [yellow]⚠[/] AI metadata detected in {source.name}:")
            meta = get_ai_metadata(source)
+            if synthid := meta.get("synthid_watermark"):
+                console.print(f"  [bold yellow]⚠ SynthID pixel watermark {synthid}[/]")
            table = Table(show_header=True, header_style="bold")
            table.add_column("Key", style="cyan")
            table.add_column("Value")
@@ -142,80 +142,45 @@ def has_ai_metadata(image_path: Path) -> bool:
    return any(marker in data for marker in IPTC_AI_MARKERS)


-def _scan_png_c2pa_chunk(image_path: Path) -> dict[str, str]:
-    """Extract a human-readable summary of the C2PA manifest in a PNG file.
+def synthid_source(image_path: Path) -> str | None:
+    """Return the vendor name(s) if the image carries a SynthID pixel watermark.

-    PIL does not expose the caBX JUMBF box via ``img.info``, so we delegate
-    chunk extraction to the existing ``extract_c2pa_chunk`` helper and pull
-    key fields from the JUMBF payload without a full CBOR parser.
+    This is a *metadata-based* proxy: Google (Imagen/Gemini) and OpenAI
+    (ChatGPT/DALL-E/gpt-image) embed an invisible SynthID watermark alongside
+    a C2PA manifest, so a C2PA manifest signed by one of them on AI-generated
+    content implies SynthID in the pixels. Adobe Firefly / Microsoft Designer
+    sign C2PA but do not use SynthID, so they return None.
+
+    The verdict is reliable only while the C2PA manifest is intact -- absence
+    is not proof, because C2PA can be stripped while the pixel watermark
+    survives, and the pixel watermark itself is not locally detectable
+    (proprietary decoder).
+
+    Args:
+        image_path: Path to the image (PNG, JPEG, WebP, or ISOBMFF container).
+
+    Returns:
+        Comma-joined vendor name(s) (e.g. ``"OpenAI"``) or None.
    """
-    import re
+    from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_vendors_in

-    from remove_ai_watermarks.noai.c2pa import extract_c2pa_chunk
+    # PNG: the caBX chunk parser gives a clean, structured issuer.
+    vendors = extract_c2pa_info(image_path).get("synthid_vendors")
+    if vendors:
+        return ", ".join(vendors)

-    raw = extract_c2pa_chunk(image_path)
-    if raw is None:
-        return {}
-
-    # extract_c2pa_chunk returns chunk_header (8 bytes) + data + crc (4 bytes).
-    payload = raw[8:-4]
-    result: dict[str, str] = {"c2pa_manifest": f"C2PA manifest ({len(payload)} bytes)"}
-
-    def _cbor_text_after(key: bytes) -> str | None:
-        """Return the CBOR text-string immediately following ``key``.
-
-        Handles CBOR major-type 3 length prefixes: direct (0x60-0x77),
-        1-byte (0x78 NN), and 2-byte (0x79 NN NN).
-        """
-        idx = payload.find(key)
-        if idx < 0:
-            return None
-        p = idx + len(key)
-        if p >= len(payload):
-            return None
-        head = payload[p]
-        if 0x60 <= head <= 0x77:
-            length, start = head - 0x60, p + 1
-        elif head == 0x78 and p + 1 < len(payload):
-            length, start = payload[p + 1], p + 2
-        elif head == 0x79 and p + 2 < len(payload):
-            length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
-        else:
-            return None
-        raw_str = payload[start : start + length]
-        try:
-            return raw_str.decode("utf-8")
-        except UnicodeDecodeError:
-            return raw_str.decode("latin1", errors="replace")
-
-    if generator := _cbor_text_after(b"name"):
-        result["claim_generator"] = generator
-
-    if spec := _cbor_text_after(b"specVersion"):
-        result["c2pa_spec"] = spec
-
-    dst_match = re.search(
-        rb"(http://cv\.iptc\.org/newscodes/digitalsourcetype/[A-Za-z0-9_-]+)",
-        payload,
-    )
-    if dst_match:
-        result["digital_source_type"] = dst_match.group(1).decode("latin1")
-
-    actions = sorted(
-        {m.decode("latin1") for m in re.findall(rb"c2pa\.(created|converted|edited|opened|placed)", payload)}
-    )
-    if actions:
-        result["c2pa_actions"] = ", ".join(actions)
-
-    # Scan cert DN printable strings for the signer org name.
-    signer_match = re.search(
-        rb"([A-Za-z][A-Za-z0-9 .,&'()\-]{2,48}OpenAI[A-Za-z0-9 .,&'()\-]{0,48})",
-        payload,
-    )
-    if signer_match:
-        result["signer"] = signer_match.group(1).decode("latin1").strip()
-
-    return result
+    # Non-PNG containers (JPEG APP11, WebP, AVIF/HEIF/JXL uuid box) keep the
+    # C2PA manifest where the PNG parser can't reach it. Binary-scan for the
+    # same signal: a C2PA manifest from a SynthID-using issuer on AI content.
+    with open(image_path, "rb") as f:
+        data = f.read(1024 * 1024)
+    has_c2pa = b"c2pa" in data.lower() or C2PA_UUID in data
+    # Matches both "trainedAlgorithmicMedia" and "compositeWithTrainedAlgorithmicMedia".
+    ai_source = b"trainedAlgorithmicMedia" in data or b"TrainedAlgorithmicMedia" in data
+    if not (has_c2pa and ai_source):
+        return None
+    matched = synthid_vendors_in(data)
+    return ", ".join(matched) if matched else None


 def get_ai_metadata(image_path: Path) -> dict[str, str]:
@@ -229,6 +194,8 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
    """
    from PIL import Image

+    from remove_ai_watermarks.noai.c2pa import extract_c2pa_info, synthid_verdict
+
    result: dict[str, str] = {}

    with Image.open(image_path) as img:
@@ -241,7 +208,24 @@ def get_ai_metadata(image_path: Path) -> dict[str, str]:
                else:
                    result[key] = str(value)

-    result.update(_scan_png_c2pa_chunk(image_path))
+    # C2PA manifest fields from the single canonical parser (noai/c2pa.py).
+    c2pa = extract_c2pa_info(image_path)
+    for key in (
+        "c2pa_manifest",
+        "claim_generator",
+        "c2pa_spec",
+        "issuer",
+        "source_type",
+        "actions",
+        "synthid_watermark",
+    ):
+        if key in c2pa:
+            result.setdefault(key, str(c2pa[key]))
+
+    # Non-PNG containers (JPEG/WebP/AVIF): extract_c2pa_info is PNG-only, so
+    # fall back to the format-agnostic source check for the SynthID verdict.
+    if "synthid_watermark" not in result and (vendor := synthid_source(image_path)):
+        result.setdefault("synthid_watermark", synthid_verdict(vendor))
    return result


@@ -29,6 +29,7 @@ from remove_ai_watermarks.noai.constants import (
    C2PA_ISSUERS,
    C2PA_SIGNATURES,
    PNG_SIGNATURE,
+    SYNTHID_C2PA_ISSUERS,
 )


@@ -129,10 +130,56 @@ def extract_c2pa_info(image_path: Path) -> dict[str, Any]:
    return c2pa_info


+def _cbor_text_after(payload: bytes, key: bytes) -> str | None:
+    """Return the CBOR text-string immediately following ``key`` in ``payload``.
+
+    Handles CBOR major-type 3 length prefixes: direct (0x60-0x77), 1-byte
+    (0x78 NN), and 2-byte (0x79 NN NN). This reads the actual encoded value, so
+    it avoids the byte-grabbing artifacts a loose regex produces (e.g. the
+    leading length byte showing up as ``fGPT-4o``).
+    """
+    idx = payload.find(key)
+    if idx < 0:
+        return None
+    p = idx + len(key)
+    if p >= len(payload):
+        return None
+    head = payload[p]
+    if 0x60 <= head <= 0x77:
+        length, start = head - 0x60, p + 1
+    elif head == 0x78 and p + 1 < len(payload):
+        length, start = payload[p + 1], p + 2
+    elif head == 0x79 and p + 2 < len(payload):
+        length, start = (payload[p + 1] << 8) | payload[p + 2], p + 3
+    else:
+        return None
+    raw_str = payload[start : start + length]
+    try:
+        return raw_str.decode("utf-8")
+    except UnicodeDecodeError:
+        return raw_str.decode("latin1", errors="replace")
+
+
+def synthid_verdict(vendors: str) -> str:
+    """Human-readable SynthID-source verdict, shared by all callers."""
+    return f"likely present ({vendors} embeds SynthID with C2PA)"
+
+
+def synthid_vendors_in(buffer: bytes) -> list[str]:
+    """Return SynthID-using C2PA issuer names whose signature appears in ``buffer``.
+
+    Shared by the PNG caBX parser and the format-agnostic binary scan so both
+    apply the same SYNTHID_C2PA_ISSUERS rule against their respective bytes.
+    """
+    return sorted({name for sig, name in C2PA_ISSUERS.items() if sig in buffer and sig in SYNTHID_C2PA_ISSUERS})
+
+
 def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
    """Parse C2PA chunk data and populate info dictionary."""
+    c2pa_info["c2pa_manifest"] = f"C2PA manifest ({len(chunk_data)} bytes)"
+
    # Find issuers
-    issuers = []
+    issuers: list[str] = []
    for sig, name in C2PA_ISSUERS.items():
        if sig in chunk_data:
            issuers.append(name)
@@ -140,44 +187,22 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
        c2pa_info["issuer"] = ", ".join(set(issuers))

    # Find AI tools
-    ai_tools = []
+    ai_tools: list[str] = []
    for sig, name in C2PA_AI_TOOLS.items():
        if sig in chunk_data:
            ai_tools.append(name)
    if ai_tools:
        c2pa_info["ai_tool"] = ", ".join(set(ai_tools))

-    # Extract software agent (multiple patterns)
-    patterns = [
-        rb"softwareAgent.*?dname([^\x00]+?)(?:q|l|m|n)",
-        rb"software_agent[^\x00]*?([A-Za-z0-9_\-\.]+)",
-        rb"Software[^\x00]*?([A-Za-z0-9_\-\. ]+)",
-    ]
-    for pattern in patterns:
-        match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
-        if match:
-            agent = match.group(1).decode("utf-8", errors="ignore").strip()
-            if agent and len(agent) < 100:
-                c2pa_info["software_agent"] = agent
-                break
-
-    # Extract claim generator (multiple patterns)
-    claim_patterns = [
-        rb"claim_generator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
-        rb"claimGenerator[^\x00]*?([A-Za-z0-9_\-\.\/\:]+)",
-        rb"dname([^\x00]{3,50})(?:q|l|m|n|i)",
-    ]
-    for pattern in claim_patterns:
-        match = re.search(pattern, chunk_data, re.DOTALL | re.IGNORECASE)
-        if match:
-            gen_name = match.group(1).decode("utf-8", errors="ignore").strip()
-            # Filter out common false positives
-            if gen_name and len(gen_name) < 100 and not gen_name.startswith(("\\x", "\\\\x")):
-                c2pa_info["claim_generator"] = gen_name
-                break
+    # Claim generator and spec version: read the CBOR text-string values
+    # directly (regex byte-grabbing produced artifacts like ``fGPT-4o``).
+    if generator := _cbor_text_after(chunk_data, b"name"):
+        c2pa_info["claim_generator"] = generator
+    if spec := _cbor_text_after(chunk_data, b"specVersion"):
+        c2pa_info["c2pa_spec"] = spec

    # Find actions
-    actions = []
+    actions: list[str] = []
    for sig, name in C2PA_ACTIONS.items():
        if sig in chunk_data:
            actions.append(name)
@@ -192,12 +217,23 @@ def _parse_c2pa_chunk(chunk_data: bytes, c2pa_info: dict[str, Any]) -> None:
            c2pa_info["timestamps"] = [t.decode("utf-8") for t in timestamp_matches[:3]]

    # Find digital source type
+    ai_source = False
    if b"trainedAlgorithmicMedia" in chunk_data:
        c2pa_info["source_type"] = "trainedAlgorithmicMedia (AI-generated)"
+        ai_source = True
    elif b"algorithmicMedia" in chunk_data:
        c2pa_info["source_type"] = "algorithmicMedia"
    elif b"compositeWithTrainedAlgorithmicMedia" in chunk_data:
        c2pa_info["source_type"] = "compositeWithTrainedAlgorithmicMedia (AI-enhanced)"
+        ai_source = True
+
+    # SynthID pixel-watermark proxy: a C2PA manifest from a SynthID-using
+    # vendor (Google/OpenAI) on AI-generated content implies an invisible
+    # SynthID watermark in the pixels (see SYNTHID_C2PA_ISSUERS).
+    synthid_vendors = synthid_vendors_in(chunk_data)
+    if synthid_vendors and ai_source:
+        c2pa_info["synthid_vendors"] = synthid_vendors
+        c2pa_info["synthid_watermark"] = synthid_verdict(", ".join(synthid_vendors))


 def extract_c2pa_chunk(image_path: Path) -> bytes | None:
@@ -87,6 +87,27 @@ C2PA_ISSUERS = {
    b"Truepic": "Truepic",
 }

+# C2PA issuers whose signed outputs also carry an invisible SynthID pixel
+# watermark -- a metadata proxy for "SynthID is in the pixels":
+#   - Google (Imagen/Gemini): embeds SynthID, long-standing (DeepMind docs).
+#   - OpenAI (ChatGPT/Codex/API): pairs SynthID with C2PA since ~2026-05-20.
+#     Confirmed by OpenAI's Help Center ("C2PA and SynthID in OpenAI-generated
+#     images", updated 2026-05-21): "Images generated with ChatGPT, Codex, and
+#     our API include both C2PA metadata and SynthID watermarks." OpenAI also
+#     notes a signal may be absent if "the image was created before these
+#     signals were available" -- so OpenAI images from BEFORE the rollout carry
+#     C2PA WITHOUT SynthID (e.g. data/samples/openai-images-2/amur-leopard.png,
+#     C2PA timestamp 2026-04-22). For OpenAI the proxy is therefore "likely",
+#     not certain; the verdict string is hedged accordingly. OpenAI's own oracle
+#     is openai.com/verify (Google's is the Gemini app "Verify with SynthID").
+# The issuer byte ("OpenAI"/"Google") is verified locally against data/samples;
+# the SynthID pairing is documented behavior (Google: DeepMind; OpenAI: above).
+# Adobe Firefly and Microsoft Designer sign C2PA but do NOT use SynthID, so a
+# C2PA manifest alone is not a SynthID signal -- the issuer is. The pixel
+# watermark is not locally detectable (proprietary decoder); the C2PA companion
+# is the proxy, and only while the manifest is intact.
+SYNTHID_C2PA_ISSUERS = frozenset({b"Google", b"OpenAI"})
+
 # C2PA known AI tools
 C2PA_AI_TOOLS = {
    b"GPT-4o": "GPT-4o",
@@ -106,6 +127,8 @@ C2PA_ACTIONS = {
    b"c2pa.filtered": "filtered",
    b"c2pa.cropped": "cropped",
    b"c2pa.resized": "resized",
+    b"c2pa.opened": "opened",
+    b"c2pa.placed": "placed",
 }

 # PNG signature
@@ -4,6 +4,7 @@ from __future__ import annotations

 from pathlib import Path

+import pytest
 from PIL import Image
 from PIL.PngImagePlugin import PngInfo

@@ -12,8 +13,12 @@ from remove_ai_watermarks.metadata import (
    get_ai_metadata,
    has_ai_metadata,
    remove_ai_metadata,
+    synthid_source,
 )

+# Real, committed C2PA sample images used to ground the SynthID-source tests.
+SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
+
 # ── Key detection ───────────────────────────────────────────────────


@@ -144,6 +149,109 @@ class TestGetAiMetadata:
        meta = get_ai_metadata(tmp_clean_png)
        assert meta == {}

+    def test_long_value_is_truncated(self, tmp_path: Path):
+        img = Image.new("RGB", (32, 32))
+        pnginfo = PngInfo()
+        pnginfo.add_text("parameters", "x" * 300)
+        path = tmp_path / "long.png"
+        img.save(path, pnginfo=pnginfo)
+        meta = get_ai_metadata(path)
+        assert meta["parameters"].endswith("…")
+        assert len(meta["parameters"]) <= 205
+
+
+@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
+class TestGetAiMetadataRealSample:
+    """get_ai_metadata surfaces the consolidated C2PA fields on real images."""
+
+    def test_openai_sample_fields(self):
+        meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
+        assert "claim_generator" in meta
+        assert "OpenAI" in meta["issuer"]
+        assert "OpenAI" in meta["synthid_watermark"]
+        assert "trainedAlgorithmicMedia" in meta["source_type"]
+
+
+@pytest.mark.parametrize(
+    "marker",
+    [
+        b"trainedAlgorithmicMedia",
+        b"compositeSynthetic",
+        b"algorithmicMedia",
+        b"compositeWithTrainedAlgorithmicMedia",
+    ],
+)
+def test_has_ai_metadata_detects_each_iptc_marker(tmp_path: Path, marker: bytes):
+    """Each IPTC digitalSourceType AI marker in XMP triggers detection."""
+    path = tmp_path / "iptc.jpg"
+    path.write_bytes(b"\xff\xd8\xff\xe1<x:xmpmeta>" + marker + b"</x:xmpmeta>\xff\xd9")
+    assert has_ai_metadata(path)
+
+
+# ── SynthID-source detection (metadata proxy) ────────────────────────
+
+
+@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
+class TestSynthIDSource:
+    """SynthID detection via the C2PA companion manifest.
+
+    Google (Imagen/Gemini) and OpenAI (ChatGPT/DALL-E/gpt-image) pair an
+    invisible SynthID pixel watermark with a C2PA manifest. Adobe Firefly and
+    Microsoft Designer sign C2PA Content Credentials but do NOT use SynthID,
+    so the discriminating signal is the C2PA *issuer*, not the mere presence
+    of a manifest. These tests run against real, committed sample images.
+    """
+
+    def test_openai_chatgpt_is_synthid_source(self):
+        assert synthid_source(SAMPLES_DIR / "chatgpt-1.png") == "OpenAI"
+
+    def test_openai_verdict_in_get_ai_metadata(self):
+        meta = get_ai_metadata(SAMPLES_DIR / "chatgpt-1.png")
+        assert "synthid_watermark" in meta
+        assert "OpenAI" in meta["synthid_watermark"]
+
+    def test_adobe_firefly_is_not_synthid_source(self):
+        # Adobe signs C2PA (trainedAlgorithmicMedia) but embeds no SynthID.
+        assert synthid_source(SAMPLES_DIR / "firefly-1.png") is None
+        assert "synthid_watermark" not in get_ai_metadata(SAMPLES_DIR / "firefly-1.png")
+
+    def test_non_ai_image_is_not_synthid_source(self):
+        assert synthid_source(SAMPLES_DIR / "not-ai-1.jpeg") is None
+
+
+class TestSynthIDSourceNonPng:
+    """SynthID-source detection must work beyond PNG.
+
+    ChatGPT/Gemini images saved as JPEG/WebP/AVIF carry their C2PA manifest in
+    a non-PNG container (JPEG APP11, ISOBMFF uuid box), so the PNG caBX parser
+    misses them. These use synthetic byte blobs (real fixtures aren't shipped).
+    """
+
+    def _c2pa_jpeg(self, tmp_path: Path, name: str, issuer: bytes, marker: bytes = b"trainedAlgorithmicMedia") -> Path:
+        path = tmp_path / name
+        # Minimal JPEG shell with an embedded C2PA-ish blob.
+        blob = b"jumbc2pa" + issuer + b"..." + marker
+        path.write_bytes(b"\xff\xd8\xff\xe1" + blob + b"\xff\xd9")
+        return path
+
+    def test_openai_c2pa_in_jpeg(self, tmp_path: Path):
+        path = self._c2pa_jpeg(tmp_path, "chatgpt.jpg", b"OpenAI")
+        assert synthid_source(path) == "OpenAI"
+
+    def test_google_c2pa_in_jpeg(self, tmp_path: Path):
+        path = self._c2pa_jpeg(tmp_path, "gemini.jpg", b"Google")
+        assert synthid_source(path) == "Google LLC"
+
+    def test_adobe_c2pa_in_jpeg_is_none(self, tmp_path: Path):
+        # Adobe signs C2PA but embeds no SynthID.
+        path = self._c2pa_jpeg(tmp_path, "firefly.jpg", b"Adobe")
+        assert synthid_source(path) is None
+
+    def test_openai_without_ai_marker_is_none(self, tmp_path: Path):
+        # Issuer present but no AI digital-source marker -> not a SynthID source.
+        path = self._c2pa_jpeg(tmp_path, "edited.jpg", b"OpenAI", marker=b"")
+        assert synthid_source(path) is None
+

 # ── remove_ai_metadata ──────────────────────────────────────────────

@@ -2,10 +2,18 @@

 from __future__ import annotations

+import struct
+from pathlib import Path
+
+import pytest
+
 from remove_ai_watermarks.noai.c2pa import (
+    _cbor_text_after,
    extract_c2pa_chunk,
    extract_c2pa_info,
    has_c2pa_metadata,
+    inject_c2pa_chunk,
+    synthid_verdict,
 )
 from remove_ai_watermarks.noai.cleaner import (
    has_ai_content,
@@ -26,6 +34,10 @@ from remove_ai_watermarks.noai.extractor import (
    get_ai_metadata_summary,
    has_ai_metadata,
 )
+from remove_ai_watermarks.noai.isobmff import (
+    is_isobmff,
+    strip_c2pa_boxes,
+)

 # ── Constants ───────────────────────────────────────────────────────

@@ -128,3 +140,130 @@ class TestC2PA:

    def test_c2pa_returns_false_for_non_png(self, tmp_jpeg_path):
        assert not has_c2pa_metadata(tmp_jpeg_path)
+
+
+SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
+
+
+@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
+class TestC2PARealSamples:
+    """Parser behavior on real committed C2PA images."""
+
+    def test_detects_c2pa_in_openai_png(self):
+        assert has_c2pa_metadata(SAMPLES_DIR / "chatgpt-1.png")
+
+    def test_extract_info_openai_fields(self):
+        info = extract_c2pa_info(SAMPLES_DIR / "chatgpt-1.png")
+        assert info["has_c2pa"] is True
+        assert "OpenAI" in info["issuer"]
+        assert "c2pa_manifest" in info  # "C2PA manifest (N bytes)"
+        assert "trainedAlgorithmicMedia" in info["source_type"]
+        # CBOR-clean claim generator, no regex artifacts (e.g. "fGPT-4o").
+        assert info["claim_generator"]
+        assert not info["claim_generator"].startswith("f")
+        assert "synthid_watermark" in info
+
+    def test_extract_info_adobe_has_no_synthid(self):
+        info = extract_c2pa_info(SAMPLES_DIR / "firefly-1.png")
+        assert "Adobe" in info["issuer"]
+        assert "synthid_watermark" not in info
+
+    def test_extract_chunk_returns_bytes(self):
+        chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
+        assert chunk is not None
+        assert chunk[4:8] == b"caBX"  # chunk type in the 8-byte header
+
+    def test_inject_round_trip(self, tmp_clean_png, tmp_path):
+        """Extract a real C2PA chunk, inject into a clean PNG, re-detect."""
+        chunk = extract_c2pa_chunk(SAMPLES_DIR / "chatgpt-1.png")
+        out = tmp_path / "injected.png"
+        inject_c2pa_chunk(tmp_clean_png, out, chunk)
+        assert has_c2pa_metadata(out)
+        assert "OpenAI" in extract_c2pa_info(out)["issuer"]
+
+
+class TestC2PAInjectValidation:
+    def test_inject_rejects_non_png(self, tmp_path):
+        with pytest.raises(ValueError, match="only supported for PNG"):
+            inject_c2pa_chunk(tmp_path / "in.jpg", tmp_path / "out.png", b"")
+
+
+# ── CBOR text extraction (parser internals) ─────────────────────────
+
+
+class TestCborTextAfter:
+    """_cbor_text_after handles the three CBOR text-string length prefixes."""
+
+    def test_direct_length(self):
+        # major-type 3, direct length (0x60 + len). "abc" -> 0x63.
+        payload = b"name" + bytes([0x63]) + b"abc"
+        assert _cbor_text_after(payload, b"name") == "abc"
+
+    def test_one_byte_length(self):
+        s = b"x" * 30
+        payload = b"name" + bytes([0x78, 30]) + s
+        assert _cbor_text_after(payload, b"name") == "x" * 30
+
+    def test_two_byte_length(self):
+        s = b"y" * 300
+        payload = b"name" + bytes([0x79]) + struct.pack(">H", 300) + s
+        assert _cbor_text_after(payload, b"name") == "y" * 300
+
+    def test_key_not_found_returns_none(self):
+        assert _cbor_text_after(b"nothing here", b"name") is None
+
+    def test_key_at_end_returns_none(self):
+        assert _cbor_text_after(b"prefixname", b"name") is None
+
+    def test_invalid_head_returns_none(self):
+        # 0x00 is not a text-string head.
+        assert _cbor_text_after(b"name" + bytes([0x00]) + b"abc", b"name") is None
+
+    def test_latin1_fallback_on_invalid_utf8(self):
+        payload = b"name" + bytes([0x61]) + b"\xff"  # len 1, invalid utf-8
+        assert _cbor_text_after(payload, b"name") is not None
+
+
+class TestSynthIDVerdict:
+    def test_format(self):
+        assert synthid_verdict("OpenAI") == "likely present (OpenAI embeds SynthID with C2PA)"
+
+    def test_multiple_vendors(self):
+        assert "Google LLC, OpenAI" in synthid_verdict("Google LLC, OpenAI")
+
+
+# ── ISOBMFF (AVIF / HEIF / JPEG-XL container stripping) ──────────────
+
+FTYP = b"\x00\x00\x00\x18ftypavif\x00\x00\x00\x00avifmif1"  # 24-byte ftyp box
+
+
+class TestISOBMFF:
+    def test_is_isobmff_true(self):
+        assert is_isobmff(FTYP)
+
+    def test_is_isobmff_false_for_png(self):
+        assert not is_isobmff(b"\x89PNG\r\n\x1a\n\x00\x00")
+
+    def test_is_isobmff_false_for_short(self):
+        assert not is_isobmff(b"abc")
+
+    def test_strips_jpegxl_jumb_box(self):
+        """JPEG-XL stores JUMBF in a ``jumb`` box, always stripped."""
+        jumb = struct.pack(">I", 8 + 5) + b"jumb" + b"hello"
+        cleaned, stripped = strip_c2pa_boxes(FTYP + jumb)
+        assert stripped == 1
+        assert cleaned == FTYP
+
+    def test_keeps_non_c2pa_box_with_64bit_size(self):
+        """size==1 means a 64-bit largesize follows; non-C2PA box is kept."""
+        payload = b"\x00" * 8
+        box = b"\x00\x00\x00\x01" + b"free" + struct.pack(">Q", 16 + len(payload)) + payload
+        cleaned, stripped = strip_c2pa_boxes(FTYP + box)
+        assert stripped == 0
+        assert cleaned == FTYP + box
+
+    def test_malformed_box_does_not_crash(self):
+        # A box claiming size 4 (< 8-byte header) must terminate iteration safely.
+        cleaned, stripped = strip_c2pa_boxes(FTYP + b"\x00\x00\x00\x04XXXX")
+        assert stripped == 0
+        assert cleaned.startswith(FTYP)
@@ -87,6 +87,24 @@ class TestModelProfiles:
    def test_recommended_strength_medium(self):
        assert get_recommended_strength("unknown_type") == 0.35

+    @pytest.mark.parametrize("wm_type", ["stegastamp", "stegasamp", "treering", "ringid"])
+    def test_high_perturbation_watermark_types(self, wm_type):
+        """Robust spatial watermarks need aggressive (0.7) regeneration."""
+        assert get_recommended_strength(wm_type) == 0.7
+
+    @pytest.mark.parametrize("wm_type", ["stablesignature", "dwtectsvd", "rivagan", "ssl", "hidden"])
+    def test_low_perturbation_watermark_types(self, wm_type):
+        """Fragile frequency/latent watermarks break at low (0.04) strength."""
+        assert get_recommended_strength(wm_type) == 0.04
+
+    def test_strength_match_is_case_insensitive(self):
+        assert get_recommended_strength("TreeRing") == 0.7
+        assert get_recommended_strength("StableSignature") == 0.04
+
+    def test_strength_matches_substring_in_descriptive_name(self):
+        # e.g. a CLI passing "treering_v2" or "synthid-stegastamp" still maps.
+        assert get_recommended_strength("treering_v2") == 0.7
+

 # ── Format utilities ────────────────────────────────────────────────

@@ -0,0 +1,66 @@
+"""Tests for the SynthID corpus ingestion script."""
+
+from __future__ import annotations
+
+import csv
+import sys
+from pathlib import Path
+
+import pytest
+from click.testing import CliRunner
+
+# scripts/ is not an installed package; add it to the path for import.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "scripts"))
+
+import synthid_corpus
+
+SAMPLES_DIR = Path(__file__).resolve().parent.parent / "data" / "samples"
+
+
+def _manifest_rows(root: Path) -> list[dict[str, str]]:
+    with open(root / "manifest.csv", newline="") as f:
+        return list(csv.DictReader(f))
+
+
+@pytest.mark.skipif(not SAMPLES_DIR.exists(), reason="data/samples not present")
+class TestIngest:
+    def test_ingest_openai_flags_synthid_metadata(self, tmp_path: Path):
+        runner = CliRunner()
+        result = runner.invoke(
+            synthid_corpus.cli,
+            ["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)],
+        )
+        assert result.exit_code == 0, result.output
+
+        rows = _manifest_rows(tmp_path)
+        assert len(rows) == 1
+        row = rows[0]
+        assert row["label"] == "pos"
+        assert row["synthid_metadata"] == "yes"
+        assert int(row["width"]) > 0
+        assert int(row["height"]) > 0
+        # The copied file lands under images/pos/ with a sha-prefixed name.
+        assert (tmp_path / "images" / "pos" / row["filename"]).exists()
+
+    def test_ingest_firefly_not_flagged(self, tmp_path: Path):
+        runner = CliRunner()
+        runner.invoke(
+            synthid_corpus.cli,
+            ["ingest", str(SAMPLES_DIR / "firefly-1.png"), "--label", "neg", "--root", str(tmp_path)],
+        )
+        rows = _manifest_rows(tmp_path)
+        assert len(rows) == 1
+        assert rows[0]["synthid_metadata"] == ""  # Adobe signs C2PA but not SynthID
+
+    def test_ingest_dedupes_by_sha256(self, tmp_path: Path):
+        runner = CliRunner()
+        args = ["ingest", str(SAMPLES_DIR / "chatgpt-1.png"), "--label", "pos", "--root", str(tmp_path)]
+        runner.invoke(synthid_corpus.cli, args)
+        runner.invoke(synthid_corpus.cli, args)  # second time: duplicate
+        assert len(_manifest_rows(tmp_path)) == 1
+
+    def test_status_on_empty_corpus(self, tmp_path: Path):
+        runner = CliRunner()
+        result = runner.invoke(synthid_corpus.cli, ["status", "--root", str(tmp_path)])
+        assert result.exit_code == 0
+        assert "empty" in result.output.lower()