remove-ai-watermarks

mirror of https://github.com/wiltodelta/remove-ai-watermarks.git synced 2026-06-05 10:38:00 +02:00

Author	SHA1	Message	Date
Victor Kuznetsov	223cbcf171	feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk (as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field so a generic AIGC key cannot false-positive. New huggingface_job() reads the hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal (parallel to the visible sparkle, never overriding a hard metadata verdict). Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3 hf-job files the library previously reported as Unknown. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:40:17 -07:00
test-user	18160fe269	feat(identify): integrity-clash detection for contradictory provenance (v0.6.7) Surface contradictions between independent provenance signals instead of collapsing to a single verdict -- a strong tell of spoofed, transplanted, or laundered metadata. Inspired by arXiv:2603.02378. Two rules in the new _integrity_clashes helper: - Conflicting AI-origin attributions: two or more distinct AI vendors named by independent generator stamps (e.g. a C2PA OpenAI manifest on an image whose EXIF says Make="Ideogram AI"). - Camera + AI: a camera-capture C2PA device (Pixel/Leica/Sony/Nikon/Truepic) coexisting with an AI-generation marker -- a genuine capture is not AI. High-precision by design: only hard generator stamps feed it (C2PA issuer when the source is AI, SynthID proxy, EXIF/XMP generator, IPTC AISystemUsed, xAI, AIGC). The fuzzy visible sparkle and the open invisible watermark are excluded -- the latter can be a by-product of our own SDXL removal pass. Vendor normalization (_vendor_of over _AI_VENDOR_TOKENS) keeps consistent signals from clashing (C2PA "Google (Gemini)" + SynthID-Google agree); the C2PA vendor is read from the issuer attribution, not the resolved platform, so a camera label like "Google Pixel" cannot mis-normalize to an AI vendor. Surfaced as ProvenanceReport.integrity_clashes (red in the table view, included in --json). 19 new tests; all real single-origin fixtures (chatgpt/firefly/ doubao/grok/mj) verified to produce zero clashes (false-positive guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:27:25 -07:00
test-user	e1c99b5937	fix(identify): gate C2PA issuer->generator attribution on AI source type (v0.6.5) Prevents an unmapped C2PA device whose manifest incidentally contains a mapped issuer substring (e.g. the "Adobe XMP" toolkit string in a Canon/Sony camera capture) from being mislabeled as that AI generator ("Adobe Firefly"). _attribute_platform now names a specific AI-generator platform only when the digital-source-type is trainedAlgorithmicMedia; otherwise it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type and is unaffected (verified: chatgpt-1.png->OpenAI, firefly-1.png->Adobe Firefly still attribute). Closes the only real downside of leaving Canon/Samsung/Bria device signers unmapped: detection and removal were already unaffected; now the platform label degrades gracefully too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:29:12 -07:00
test-user	9f93d9c0c5	feat(identify): add Sony C2PA device attribution, verified (v0.6.3) Adds Sony to _DEVICE_C2PA_PLATFORM, matching Sony's own `sony.sig` / `sony.cert` C2PA assertion namespace (NOT bare "Sony", which is a common EXIF Make). Verified against a real Sony-signed file (Sony PXW-Z300, signer "Sony Corporation") found in the Security4Media/c2pa-video-player repo. The sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the namespace. Verified device set is now Leica, Nikon, Google Pixel, Sony, Truepic. Canon / Samsung / Bria still have no public direct-download C2PA sample to verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:13:49 -07:00
test-user	64be9598f2	fix(identify): device-token-first C2PA attribution; add verified Pixel (v0.6.2) Replaces the claim-generator-string match with a distinctive device-token scan of the manifest bytes (_device_platform / _DEVICE_C2PA_PLATFORM), which is more robust: it catches devices where the generator name lives under a non-standard CBOR key (Pixel uses `claim_generator_info`, so it has no `claim_generator`). - Adds Google Pixel, verified against a real Pixel 10 Pro C2PA file (attached to c2pa-rs issue #1609/#1554): cert CN "Pixel Camera", digitalSourceType `computationalCapture` -> capture authenticity, not AI (is_ai stays None). - Token distinctiveness is load-bearing: bare "Truepic" matched the OpenAI chatgpt-1.png fixture (Truepic is a trust-chain signing authority), so the token is the specific "Truepic_Lens"; "Pixel Camera" (cert CN) not "Pixel". - Verified Leica/Nikon/Truepic/Pixel attribute correctly and OpenAI/Adobe/MJ do not regress. Sony/Canon/Samsung/Bria stay unmapped: no public direct- download C2PA sample exists to verify their in-manifest string. - Regression tests: device token beats incidental issuer mentions (Leica, Pixel-vs-Google). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:43:40 -07:00
test-user	dda2ee7fbb	fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1) Verified on real signed files that the issuer byte-scan mis-attributes multi-entity manifests: Leica read as "Truepic" (timestamp authority in the chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the claim generator (what produced the asset) and falls back to the issuer scan. - New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator read for non-PNG via the now-public c2pa.cbor_text_after. - Device tokens listed only where verified against a real C2PA file (Leica lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a real sample confirms the in-manifest string. Camera C2PA marks capture authenticity, so these never set is_ai. - cbor_text_after made public (was _cbor_text_after); call sites + tests updated. - Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:10:07 -07:00
test-user	c196a16900	feat: detect soft-binding vendors, IPTC 2025.1, video/audio C2PA, TrustMark (v0.6.0) Broadens metadata provenance coverage at the detection and container-strip level. Detection: - C2PA soft-binding `alg` -> forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...) via C2PA_SOFT_BINDINGS + soft_binding_vendors_in(); names the watermark vendor even when the watermark itself can't be decoded. - IPTC Photo Metadata 2025.1 AI-disclosure XMP fields (AISystemUsed etc.) via iptc_ai_system() + IPTC_AI_FIELD_MARKERS. - Adobe TrustMark open keyless decoder (trustmark_detector.py, optional extra `trustmark`) -- the watermark behind Adobe Durable Content Credentials. Detects provenance, not AI origin, so it does not assert is_ai. Removal / containers: - isobmff.strip_c2pa_boxes now also drops a top-level XMP uuid box that carries an AI label (matched by AI-marker content, byte-order-robust; plain XMP kept). - remove_ai_metadata routes MP4/MOV/M4V/M4A (and any ftyp-sniffed ISOBMFF) through the box stripper; raises a clear error for non-ISOBMFF audio/video (WebM/MP3/WAV) instead of crashing in the image path. Tests: soft-binding scan, IPTC element/attribute/presence, MP4 + M4A detect/ strip, ISOBMFF XMP surgical strip, content-sniff, unsupported-container guard, TrustMark absent-safety + identify integration. ruff clean; pyright clean on all new modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 17:56:48 -07:00
test-user	74618b91a7	feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5) xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist. Verified stable across 3 genuine generations (a real download previously read as unknown / "no AI metadata"). - metadata.xai_signature(): matches the Signature blob + UUID Artist pair; wired into has_ai_metadata, get_ai_metadata, and identify (platform "xAI (Grok / Aurora)"). - data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID is the public image id, not PII). - Tests: synthetic-fixture unit tests, real-sample assertion, identify integration (322 passing). Docs (research refresh, May 2026): - C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the embedded manifest is stripped). - New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02), South Korea AI Framework Act Art 31(3), California AB 853. - Hedge removal claims: defeating the SynthID verifier is not forensic invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:14:35 -07:00
test-user	03fb460f77	Track the labeled SynthID corpus; complete metadata-source test coverage Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:46:47 -07:00
test-user	c7f0d71f90	feat(identify): detect China TC260 AIGC label (Doubao et al.) China-served generators embed an XMP <TC260:AIGC>{"Label":"1",...} block (China's mandatory AI-content labeling, TC260 standard). Doubao (ByteDance) uses it -- verified on the real #13 sample. It's none of C2PA / SynthID / imwatermark / IPTC, so identify() previously returned unknown. - metadata: AIGC_MARKERS + aigc_label() (json-decodes the HTML-entity-encoded block); has_ai_metadata + get_ai_metadata now surface it. - identify: new 'aigc' signal -> is_ai True, platform 'China AIGC-labeled generator (TC260; e.g. Doubao)', carries the ContentProducer code. - Container-agnostic raw-byte scan, so it covers the whole China-AIGC ecosystem (Jimeng/Kling/Qwen/Ernie share the standard). - Tests: synthetic TC260 block (metadata + identify). Docs updated. Addresses #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:29:51 -07:00
test-user	ad3b8ee248	feat(identify): read EXIF Software / XMP CreatorTool generator tags Closes the documented gap where EXIF/XMP fields inside AVIF/HEIF/JXL went unparsed. metadata.exif_generator extracts the EXIF Software/Artist tag (via PIL+piexif, which opens AVIF natively) and the XMP CreatorTool (via a container-agnostic raw-byte scan that also covers HEIF/JXL that PIL can't open), and matches against AI_GENERATOR_TOKENS so only generator names (Firefly, DALL-E, Midjourney, ComfyUI, ...) fire -- a plain 'Adobe Photoshop' or 'GIMP' tag is not flagged. identify() surfaces it as a high-confidence signal and uses it for platform attribution when no C2PA names a platform, so an AVIF/HEIF whose only AI signal is an EXIF/XMP generator tag is now caught. Validated with synthesized fixtures (the 'no positive fixtures' blocker was self-imposed): real AVIF and JPEG written with EXIF Software via PIL, plus an XMP CreatorTool raw-scan fixture. Zero false positives across the 109-image corpus (real iPhone photos carry no AI generator token). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:56:39 -07:00
test-user	3a1c5427c8	feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label Collected live C2PA positives from Bing Image Creator and Stability Brand Studio (DreamStudio successor) and learned two things our scan got wrong: - Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs C2PA as 'Microsoft'. The scan caught it, but the platform label claimed 'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral: 'Microsoft (Bing Image Creator / Designer)'. - Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer and a platform mapping. Stability uses no SynthID and (on its current Stable Image model) no imwatermark watermark -- verified, both negative. Both ingested as SynthID-negative corpus fixtures (they are AI but not SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded design exports that strip C2PA, so a Canva sample would be inconclusive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:12:42 -07:00
test-user	27ad5b7645	feat(identify): detect open SD/SDXL/FLUX invisible watermark Research found one locally-fillable detection gap: Stable Diffusion, SDXL, and FLUX all embed an open DWT-DCT watermark via the invisible-watermark (imwatermark) library -- a PUBLIC decoder, no secret key, unlike SynthID. New invisible_watermark.py decodes the known fixed patterns (verified against upstream source: diffusers SDXL WATERMARK_MESSAGE, FLUX.2 src/flux2/watermark.py, and the 'StableDiffusionV1' default string) and identify() reports the scheme as a high-confidence signal. Verified locally end-to-end: embedding SDXL's exact 48-bit message and decoding it back recovers 48/48 bits; a clean image and our own fal-SDXL outputs decode to ~21/48 (no match). Caveat baked into the report: the watermark is fragile -- gone after JPEG q90 -- so it confirms origin only on pristine files; absence is never proof. imwatermark is an optional dep (extra 'detect'; pulls non-headless opencv), so the import is guarded and the signal is skipped when absent. CLI --no-visible now means metadata-only (skips both pixel-domain detectors). Also records the broader watermarking landscape in CLAUDE.md: which services are locally detectable (SD/SDXL/FLUX), C2PA-covered (Bing/Canva/ Getty/Shutterstock unsampled), or proprietary-only like SynthID (Amazon Titan/Nova, Kakao). Midjourney embeds neither C2PA nor an invisible mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:53:59 -07:00
test-user	1a9f3e4fe5	test(identify): cover provenance branches, CLI, sparkle helper Adds 20 tests around the new provenance path: - identify(): local SD/ComfyUI params -> local-pipeline attribution; visible-sparkle gating at the 0.5 threshold (mocked detector: above, below, unavailable, opt-out); metadata verdict not downgraded by a sparkle hit; OpenAI/SynthID caveats + dedup; ProvenanceReport is JSON-serializable (the CLI --json path); and the honest edge where a C2PA manifest without an AI source marker stays 'unknown'. - CLI 'identify': help, clean PNG, AI PNG platform, valid --json, missing file. - gemini_engine.detect_sparkle_confidence: float in range for a real image, None for an unreadable file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:27:00 -07:00
test-user	fa104bcade	feat(identify): provenance command (platform + watermark inventory) New 'identify' command and identify.py module: upload an image, get one ProvenanceReport answering where it was made and what watermarks it carries. Aggregates every locally-readable signal: - C2PA Content Credentials -> generating platform (issuer + generator). - IPTC digitalSourceType 'Made with AI' (Meta and others). - Embedded SD/ComfyUI generation parameters (local pipelines). - SynthID metadata proxy (Google / OpenAI C2PA companion). - Visible Gemini sparkle (cv2 fallback for the stripped-metadata case), promoted only at confidence >= 0.5 (corpus-tuned: Gemini sparkles score >= 0.56, non-sparkle <= 0.49). is_ai_generated is True or None, never asserted False -- stripped metadata leaves no local proof of a clean origin, so absence of signals is reported as 'unknown' with an explicit caveat. The SynthID pixel watermark remains locally undecodable; the report says so. Non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) get the same issuer + generator attribution via a binary scan (the caBX parser is PNG-only). The cv2 dependency is isolated in gemini_engine.detect_sparkle_confidence so identify.py stays type-clean. CLI supports --json and --no-visible. Validated against the 109-image corpus: 14/14 positives flagged AI, 93/94 negatives clean (the one 'neg' flagged is a Meta image that genuinely carries the IPTC tag -- correct), zero true errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:19:26 -07:00

15 Commits