Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.
Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
for private content before adding -- public repo, permanent history).
Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
(skips if the corpus is absent).
Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
CreatorTool were already covered).
All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").
- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
(fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
(manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
container stripping, all IPTC AI markers, and invisible watermark strength
tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).
Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLI with visible, invisible, all, metadata, and batch commands
- Gemini watermark removal via reverse alpha blending
- Invisible watermark removal via diffusion regeneration (SynthID, TreeRing)
- AI metadata stripping (EXIF, PNG text, C2PA)
- Face protection (YOLO/Haar) and analog humanizer
- 137 tests covering all CLI modes and core engines
- Ruff and Pyright clean