remove-ai-watermarks

mirror of https://github.com/wiltodelta/remove-ai-watermarks.git synced 2026-07-24 08:30:50 +02:00

Author	SHA1	Message	Date
Victor Kuznetsov	8523f48fb6	data(corpus): archive June 2026 SynthID strength-study subjects Back docs/synthid.md section 2.2 with the actual test set: the per-image oracle-verified subjects were only in a local working dir, while the doc claimed they were recorded in data/synthid_corpus/. Ingest the key pos+cleaned pairs so the claim holds. - pos: openai_1/2/3 originals (gpt-image, openai-verify) + gemini_1/2/3/4 originals (Gemini app, gemini-app); all probe as C2PA-SynthID present. - cleaned: OpenAI at strength 0.05 (openai_2 only s010 captured) + Gemini at 0.15 --max-resolution 1536; oracle: SynthID NOT detected. Metadata stripped, so no C2PA on the cleaned rows. - Excluded the third-party issue #14 image (pic3): oracle-verified but not committed to the public corpus. - docs/synthid.md 2.2: state OpenAI n=4 = 3 archived + 1 external-only. - CLAUDE.md: drop the drift-prone "~65 MB" corpus size from the sdist note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:09:58 -07:00
test-user	03fb460f77	Track the labeled SynthID corpus; complete metadata-source test coverage Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:46:47 -07:00
test-user	59d72c5db7	Record verified gpt-image-2 SynthID-cleaned chain in corpus Add manifest row for the 4ef377bd -> f6dd47a5 chain: a gpt-image-2 sample (openai.com/verify: SynthID + C2PA detected) cleaned via v0.5.3 `all` at native 1254x1254 (prod-equivalent SDXL base, strength 0.05, 50 steps). openai.com/verify reports SynthID NOT detected after the run, re-confirming that the #10 native-resolution default defeats OpenAI SynthID and resolving the #15 root cause (older SD-1.5/768px downscale default did not). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:55:10 -07:00
test-user	93c664f7fb	docs: sync README + corpus map with v0.5.x detection coverage - README Features: add the identify / provenance-detection capability. - README Supported models: add FLUX, Stability AI, Microsoft/Bing (MAI-Image), Meta AI rows; note SD/SDXL/FLUX imwatermark is locally detectable; add a detection note pointing at identify. - corpus README per-platform map: add Stability / Ideogram / Recraft / Krea-FLUX rows + an imwatermark column; correct Bing (MAI-Image, signs 'Microsoft'); note imwatermark fires only on pristine pipeline output, not re-hosts/exports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 09:32:03 -07:00
test-user	ede35a3db5	feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX Collected live samples from three popular generators we lacked: - Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator only read Software/Artist/ImageDescription, so it missed this; now reads Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this stays low-false-positive. 4 originals ingested. - Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures. Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 18:38:56 -07:00
test-user	3a1c5427c8	feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label Collected live C2PA positives from Bing Image Creator and Stability Brand Studio (DreamStudio successor) and learned two things our scan got wrong: - Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs C2PA as 'Microsoft'. The scan caught it, but the platform label claimed 'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral: 'Microsoft (Bing Image Creator / Designer)'. - Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer and a platform mapping. Stability uses no SynthID and (on its current Stable Image model) no imwatermark watermark -- verified, both negative. Both ingested as SynthID-negative corpus fixtures (they are AI but not SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded design exports that strip C2PA, so a Canva sample would be inconclusive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:12:42 -07:00
test-user	af787fd8d6	docs(corpus): per-platform watermark map + surface-dependent blind spot Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives, 2 positives) and document what was learned studying 8 platforms: - README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC / visible sparkle per platform) and an "originals, not previews" note (re-encoded previews strip metadata, so a clean preview is not proof). - CLAUDE.md: surface-dependent blind spot -- the same Google model wraps C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so synthid_source returns None despite SynthID being present; only the pixel oracle or the visible-sparkle detector catches those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 15:55:17 -07:00
test-user	da0edcbddc	chore(corpus): grow SynthID reference set + document autonomous Chrome collection Adds content positives (OpenAI gpt-image: forest, fisherman, tokyo; Google gemini: fisherman, mug) and SDXL/non-SynthID negatives to the local corpus manifest. Now spans 4 resolutions across 2 vendors (was solid-black only). README: documents driving generation via Chrome MCP -- Gemini single-click download; ChatGPT via in-page fetch+blob (preserves original C2PA bytes, unlike the flaky UI download / a canvas re-encode). Images stay gitignored; only the manifest (sha256 + labels + extracted metadata) and protocol are tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:46:46 -07:00
test-user	f07ce10c72	feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests Detect SynthID-bearing images via their C2PA companion: a manifest signed by a SynthID-using vendor (Google/OpenAI) on AI-generated content implies an invisible SynthID pixel watermark. Verified end-to-end against the vendor oracles (openai.com/verify, Gemini "Verify with SynthID"). - metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata, surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser + JPEG/WebP/AVIF/HEIF/JXL binary scan). - constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions. - c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from metadata; shared synthid_verdict / synthid_vendors_in helpers. - corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/ (manifest tracked, images gitignored) for a labeled reference set. - tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF container stripping, all IPTC AI markers, and invisible watermark strength tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...). Pixel-level SynthID detection remains out of reach locally (Google's decoder is proprietary); a from-scratch spectral pilot confirmed it does not separate real content. See CLAUDE.md for the full evaluation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 11:32:46 -07:00

9 Commits