Files
remove-ai-watermarks/data/synthid_corpus
test-user 03fb460f77 Track the labeled SynthID corpus; complete metadata-source test coverage
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.

Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
  synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
  for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
  for private content before adding -- public repo, permanent history).

Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
  the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
  (skips if the corpus is absent).

Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
  compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
  CreatorTool were already covered).

All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:46:47 -07:00
..

SynthID reference corpus

A locally-collected, labeled image corpus for SynthID work. Two downstream uses:

  1. Per-resolution spectral codebook for an experimental SynthID detector (carrier frequencies are resolution-dependent, so labels must record the exact native resolution).
  2. Removal regression set — verify that our pipeline turns a SynthID-positive image into a negative one.

There is no reliable local detector of the SynthID pixel watermark (Google's decoder is proprietary). The ground-truth label therefore comes from an external oracle, recorded per image in verified_via (see below).

Layout

data/synthid_corpus/
  README.md        # this protocol (committed)
  manifest.csv     # labels + provenance (committed; one row per tracked image)
  images/          # the labeled corpus (committed)
    pos/           # SynthID present
    neg/           # SynthID absent (incl. reviewed real photos)
    cleaned/       # our pipeline output from a pos image
  refs/            # synthetic black/white calibration fills (gitignored, regenerable)

The labeled images are committed so the corpus is reproducible and the removal regression set runs in CI. manifest.csv is kept in sync with the files on disk (one row per tracked image; dangling rows are pruned when files are removed). Before adding any image, confirm it carries no private or identifiable content you would not publish -- this is a public repo and git history is permanent. The synthetic refs/ fills stay gitignored (regenerable, not part of the labeled set).

Verification levels (verified_via)

Ground-truth quality, strongest first:

  • gemini-app — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
  • openai-verify — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
  • synthid-portal — checked via Google's SynthID Detector portal.
  • c2pa-metadata — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
  • third-party — label asserted by an external dataset, not independently verified.
  • none — unverified.

Prefer gemini-app for any image that will train the codebook or gate a test.

What to collect

For the codebook (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):

  • 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per resolution materially improves carrier discovery).
  • At each target resolution, also a batch of pure-black (#000000) and pure-white (#FFFFFF) fills generated by the SynthID model — these isolate the content-independent carrier (the watermark is most of the signal there).

For the regression set:

  • A handful of pos images, their cleaned counterparts (run through our pipeline), and the cleaned re-verified via gemini-app (should read negative).
  • neg controls: non-AI photos and outputs from non-SynthID models (SD, Midjourney, Firefly) verified negative.

The corpus is committed to a public repo: review every image before adding it and keep out anything private or identifiable you would not publish.

Ingesting

Use scripts/synthid_corpus.py — it copies a file in, records its sha256, resolution, format, and C2PA issuer (via our own detector), and appends a row to manifest.csv:

uv run python scripts/synthid_corpus.py ingest path/to/*.png \
    --label pos --source "Gemini app" --model gemini-3-pro \
    --verified-via gemini-app --notes "1024x1024 batch"

uv run python scripts/synthid_corpus.py status   # counts by label / resolution / verification

Autonomous collection via Chrome MCP

Generation can be driven through the browser (the account must be logged in):

  • Gemini (gemini.google.com): type Create an image: <prompt>, wait, hover the result, click the download icon (top-right). Single, reliable click. Outputs carry Google C2PA + SynthID. Occasionally the composer stalls in a "generating" state -> start a New chat to reset.

  • ChatGPT (chatgpt.com): the UI download is flaky (the fullscreen viewer races and can grab the previous image; the share-modal path works but is multi-step). Reliable path is an in-page fetch of the rendered image, which preserves the original bytes (C2PA intact, unlike a canvas re-encode):

    // run in the ChatGPT tab via the browser MCP javascript tool
    (async () => {
      const imgs = [...document.querySelectorAll('img')].filter(i => i.naturalWidth >= 400);
      const img = imgs[imgs.length - 1];                 // newest large image
      const b = await (await fetch(img.currentSrc || img.src)).blob();
      const a = document.createElement('a');
      a.href = URL.createObjectURL(b); a.download = 'dl.png';
      document.body.appendChild(a); a.click(); a.remove();
      return 'size=' + b.size;                            // do NOT return the src (privacy guard blocks query strings)
    })()
    

    Gotcha: confirm the returned size differs from the previous image before ingesting -- if the new image has not finished rendering, the script grabs the prior one (the corpus dedups by sha256, but the notes would mislabel it). ChatGPT also shows an A/B "which is better?" picker; click Skip first.

Originals, not previews. Some platforms render a low-res preview in the chat (Grok serves a ~20KB 1024px JPEG/PNG; the in-page <img> fetch grabs that, not the original). Previews are re-encoded and strip metadata, so a "clean" preview is not proof the original is clean. Always pull the original via the platform's native Download / lightbox button and sanity-check the file size (a 20KB "1024x1024" is a preview). ChatGPT's in-chat <img> is the full-res oaiusercontent original, so fetch+blob is fine there; Grok needs the lightbox Download; Leonardo serves the original JPEG in-chat (download button matches).

Per-platform watermark map (observed, May 2026)

What each platform actually embeds, verified by byte-scan (and Gemini-app oracle where noted). The detector's coverage is complementary: metadata catches C2PA / IPTC; exif_generator catches EXIF Make/Software + XMP CreatorTool; invisible_watermark.py (imwatermark) catches the open SD/SDXL/FLUX DWT-DCT watermark on pristine files; the visible detector catches the Gemini-family sparkle; the SynthID pixel itself has no local detector (oracle only).

Platform C2PA issuer SynthID pixel IPTC "Made with AI" Visible sparkle imwatermark Corpus label
Gemini app Google yes - yes - pos
ChatGPT / gpt-image OpenAI yes - - - pos
Microsoft Designer OpenAI + Microsoft yes (via OpenAI) - - - pos
Bing Image Creator Microsoft (MAI-Image) no - - - pos (C2PA "Microsoft", not OpenAI)
Google AI Studio (Nano Banana) none yes (oracle-confirmed) - yes - pos (metadata blind spot)
Stability AI (Brand Studio) Stability AI Ltd no - - no pos (C2PA only)
Ideogram none no - - no pos (EXIF Make="Ideogram AI" only)
Meta AI none no yes - - neg (for SynthID)
Leonardo.ai none no no - no neg
Recraft none (export strips) no no - no neg (re-encoded export, no signal)
Krea (FLUX 2 host) none no no - no neg (host omits the imwatermark encoder)
Grok (xAI) none (non-adopter) no no - no neg (captured: clean low-res preview)

Key takeaways:

  • The same model differs by surface: Gemini app wraps C2PA, AI Studio (API/playground) emits none -- only the pixel + sparkle survive.
  • Microsoft Designer's DALL-E backend inherits OpenAI's C2PA+SynthID (issuer "OpenAI, Microsoft"); Bing now runs Microsoft's own MAI-Image and signs C2PA as "Microsoft" (not OpenAI/DALL-E).
  • Meta uses the IPTC digitalSourceType marker, not C2PA or SynthID.
  • The open imwatermark fires only on pristine output from a pipeline that runs the encoder (diffusers default, official BFL) -- not from re-hosts (Krea, Stability hosted SDXL) or re-encoded design exports (Recraft, Canva). Ideogram's only signal is the EXIF Make tag.
  • Bing and Grok web UIs are uncooperative for autonomous capture (no document_idle for screenshots; blob downloads intermittently no-op; low-res in-chat previews). Use their native download button manually if a full-res sample is needed.