Back docs/synthid.md section 2.2 with the actual test set: the per-image
oracle-verified subjects were only in a local working dir, while the doc claimed
they were recorded in data/synthid_corpus/. Ingest the key pos+cleaned pairs so
the claim holds.
- pos: openai_1/2/3 originals (gpt-image, openai-verify) + gemini_1/2/3/4
originals (Gemini app, gemini-app); all probe as C2PA-SynthID present.
- cleaned: OpenAI at strength 0.05 (openai_2 only s010 captured) + Gemini at 0.15
--max-resolution 1536; oracle: SynthID NOT detected. Metadata stripped, so no
C2PA on the cleaned rows.
- Excluded the third-party issue #14 image (pic3): oracle-verified but not
committed to the public corpus.
- docs/synthid.md 2.2: state OpenAI n=4 = 3 archived + 1 external-only.
- CLAUDE.md: drop the drift-prone "~65 MB" corpus size from the sdist note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.
Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
for private content before adding -- public repo, permanent history).
Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
(skips if the corpus is absent).
Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
CreatorTool were already covered).
All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add manifest row for the 4ef377bd -> f6dd47a5 chain: a gpt-image-2 sample
(openai.com/verify: SynthID + C2PA detected) cleaned via v0.5.3 `all` at
native 1254x1254 (prod-equivalent SDXL base, strength 0.05, 50 steps).
openai.com/verify reports SynthID NOT detected after the run, re-confirming
that the #10 native-resolution default defeats OpenAI SynthID and resolving
the #15 root cause (older SD-1.5/768px downscale default did not).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collected live samples from three popular generators we lacked:
- Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no
SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator
only read Software/Artist/ImageDescription, so it missed this; now reads
Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this
stays low-false-positive. 4 originals ingested.
- Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable
signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running
FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures.
Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine
output from a pipeline that runs the encoder (diffusers default, official BFL),
not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collected live C2PA positives from Bing Image Creator and Stability Brand
Studio (DreamStudio successor) and learned two things our scan got wrong:
- Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs
C2PA as 'Microsoft'. The scan caught it, but the platform label claimed
'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral:
'Microsoft (Bing Image Creator / Designer)'.
- Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which
was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer
and a platform mapping. Stability uses no SynthID and (on its current
Stable Image model) no imwatermark watermark -- verified, both negative.
Both ingested as SynthID-negative corpus fixtures (they are AI but not
SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded
design exports that strip C2PA, so a Canva sample would be inconclusive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives,
2 positives) and document what was learned studying 8 platforms:
- README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC
/ visible sparkle per platform) and an "originals, not previews" note
(re-encoded previews strip metadata, so a clean preview is not proof).
- CLAUDE.md: surface-dependent blind spot -- the same Google model wraps
C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle
with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so
synthid_source returns None despite SynthID being present; only the
pixel oracle or the visible-sparkle detector catches those.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds content positives (OpenAI gpt-image: forest, fisherman, tokyo; Google
gemini: fisherman, mug) and SDXL/non-SynthID negatives to the local corpus
manifest. Now spans 4 resolutions across 2 vendors (was solid-black only).
README: documents driving generation via Chrome MCP -- Gemini single-click
download; ChatGPT via in-page fetch+blob (preserves original C2PA bytes,
unlike the flaky UI download / a canvas re-encode).
Images stay gitignored; only the manifest (sha256 + labels + extracted
metadata) and protocol are tracked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").
- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
(fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
(manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
container stripping, all IPTC AI markers, and invisible watermark strength
tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).
Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>