Files
remove-ai-watermarks/data/synthid_corpus
test-user ede35a3db5 feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX
Collected live samples from three popular generators we lacked:

- Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no
  SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator
  only read Software/Artist/ImageDescription, so it missed this; now reads
  Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this
  stays low-false-positive. 4 originals ingested.
- Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable
  signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running
  FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures.

Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine
output from a pipeline that runs the encoder (diffusers default, official BFL),
not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 18:38:56 -07:00
..

SynthID reference corpus

A locally-collected, labeled image corpus for SynthID work. Two downstream uses:

  1. Per-resolution spectral codebook for an experimental SynthID detector (carrier frequencies are resolution-dependent, so labels must record the exact native resolution).
  2. Removal regression set — verify that our pipeline turns a SynthID-positive image into a negative one.

There is no reliable local detector of the SynthID pixel watermark (Google's decoder is proprietary). The ground-truth label therefore comes from an external oracle, recorded per image in verified_via (see below).

Layout

data/synthid_corpus/
  README.md        # this protocol (committed)
  manifest.csv     # labels + provenance (committed, reviewable)
  images/          # the actual files (gitignored, local-only)
    pos/           # SynthID present
    neg/           # SynthID absent
    cleaned/       # our pipeline output from a pos image

Images are gitignored on purpose: the corpus is large, may contain personal or licensed content, and SynthID-positive outputs are best kept local. The manifest.csv (sha256 + labels + extracted metadata) is the durable artifact.

Verification levels (verified_via)

Ground-truth quality, strongest first:

  • gemini-app — checked via the Gemini app "Verify with SynthID" feature. Gold standard for the pixel watermark (Google models).
  • openai-verify — checked via openai.com/verify (gold standard for OpenAI ChatGPT/Codex/API images).
  • synthid-portal — checked via Google's SynthID Detector portal.
  • c2pa-metadata — issuer-only proxy (Google/OpenAI C2PA manifest present). Weaker: the C2PA can be stripped while the pixel watermark remains.
  • third-party — label asserted by an external dataset, not independently verified.
  • none — unverified.

Prefer gemini-app for any image that will train the codebook or gate a test.

What to collect

For the codebook (per target resolution, e.g. 1024x1024, 1024x1536, 1536x2816):

  • 30-50+ SynthID-positive outputs per resolution (more is better; ~150-200 per resolution materially improves carrier discovery).
  • At each target resolution, also a batch of pure-black (#000000) and pure-white (#FFFFFF) fills generated by the SynthID model — these isolate the content-independent carrier (the watermark is most of the signal there).

For the regression set:

  • A handful of pos images, their cleaned counterparts (run through our pipeline), and the cleaned re-verified via gemini-app (should read negative).
  • neg controls: non-AI photos and outputs from non-SynthID models (SD, Midjourney, Firefly) verified negative.

Avoid personal or identifiable content; the corpus stays local.

Ingesting

Use scripts/synthid_corpus.py — it copies a file in, records its sha256, resolution, format, and C2PA issuer (via our own detector), and appends a row to manifest.csv:

uv run python scripts/synthid_corpus.py ingest path/to/*.png \
    --label pos --source "Gemini app" --model gemini-3-pro \
    --verified-via gemini-app --notes "1024x1024 batch"

uv run python scripts/synthid_corpus.py status   # counts by label / resolution / verification

Autonomous collection via Chrome MCP

Generation can be driven through the browser (the account must be logged in):

  • Gemini (gemini.google.com): type Create an image: <prompt>, wait, hover the result, click the download icon (top-right). Single, reliable click. Outputs carry Google C2PA + SynthID. Occasionally the composer stalls in a "generating" state -> start a New chat to reset.

  • ChatGPT (chatgpt.com): the UI download is flaky (the fullscreen viewer races and can grab the previous image; the share-modal path works but is multi-step). Reliable path is an in-page fetch of the rendered image, which preserves the original bytes (C2PA intact, unlike a canvas re-encode):

    // run in the ChatGPT tab via the browser MCP javascript tool
    (async () => {
      const imgs = [...document.querySelectorAll('img')].filter(i => i.naturalWidth >= 400);
      const img = imgs[imgs.length - 1];                 // newest large image
      const b = await (await fetch(img.currentSrc || img.src)).blob();
      const a = document.createElement('a');
      a.href = URL.createObjectURL(b); a.download = 'dl.png';
      document.body.appendChild(a); a.click(); a.remove();
      return 'size=' + b.size;                            // do NOT return the src (privacy guard blocks query strings)
    })()
    

    Gotcha: confirm the returned size differs from the previous image before ingesting -- if the new image has not finished rendering, the script grabs the prior one (the corpus dedups by sha256, but the notes would mislabel it). ChatGPT also shows an A/B "which is better?" picker; click Skip first.

Originals, not previews. Some platforms render a low-res preview in the chat (Grok serves a ~20KB 1024px JPEG/PNG; the in-page <img> fetch grabs that, not the original). Previews are re-encoded and strip metadata, so a "clean" preview is not proof the original is clean. Always pull the original via the platform's native Download / lightbox button and sanity-check the file size (a 20KB "1024x1024" is a preview). ChatGPT's in-chat <img> is the full-res oaiusercontent original, so fetch+blob is fine there; Grok needs the lightbox Download; Leonardo serves the original JPEG in-chat (download button matches).

Per-platform watermark map (observed, May 2026)

What each platform actually embeds, verified by byte-scan (and Gemini-app oracle where noted). The detector's coverage is complementary: metadata catches C2PA / IPTC; the visible detector catches the Gemini-family sparkle; the SynthID pixel itself has no local detector (oracle only).

Platform C2PA issuer SynthID pixel IPTC "Made with AI" Visible sparkle Corpus label
Gemini app Google yes - yes pos
ChatGPT / gpt-image OpenAI yes - - pos
Microsoft Designer OpenAI + Microsoft yes (via OpenAI) - - pos
Google AI Studio (Nano Banana) none yes (oracle-confirmed) - yes pos (metadata blind spot)
Meta AI none no yes - neg (for SynthID)
Leonardo.ai none no no - neg
Grok (xAI) none (non-adopter) no no - neg (captured: clean low-res preview)
Bing Image Creator (DALL-E) (expect OpenAI, like Designer) - - - not captured (UI uncooperative)

Key takeaways:

  • The same model differs by surface: Gemini app wraps C2PA, AI Studio (API/playground) emits none -- only the pixel + sparkle survive.
  • Microsoft Designer's DALL-E backend inherits OpenAI's C2PA+SynthID (issuer "OpenAI, Microsoft").
  • Meta uses the IPTC digitalSourceType marker, not C2PA or SynthID.
  • Bing and Grok web UIs are uncooperative for autonomous capture (no document_idle for screenshots; blob downloads intermittently no-op; low-res in-chat previews). Use their native download button manually if a full-res sample is needed.