remove-ai-watermarks

mirror of https://github.com/wiltodelta/remove-ai-watermarks.git synced 2026-07-05 16:07:49 +02:00

Author	SHA1	Message	Date
Victor Kuznetsov	9e307d020e	test(c2pa): add real FLUX.2 BFL C2PA fixtures (PNG + JPEG) flux-1.png / flux-1.jpg are real Black Forest Labs FLUX.2 [pro] Playground outputs (signed C2PA, issuer "Black Forest Labs" + trainedAlgorithmicMedia, manifests verified to contain no personal data). flux-1.jpg is the first committed JPEG-with-C2PA fixture, exercising the c2pa-python non-PNG reader path end to end. Regression tests assert both attribute to "Black Forest Labs (FLUX)". Also documents the verified finding (n=2, 2026-06-19): BFL's hosted output carries the signed C2PA manifest but NOT the open invisible-watermark DWT-DCT (decodes to degenerate all-ones, chance-level vs the FLUX reference) -- the open pixel mark is dev-inference-code-optional only. So a hosted FLUX.2 image is identified by C2PA alone, with no open-pixel fallback once C2PA is stripped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 09:37:40 -07:00
Victor Kuznetsov	d4d9429328	feat(identify): attribute Canva and BytePlus C2PA; fix BytePlus->Adobe mislabel Mining the local production corpus (25,725 imgs) surfaced two AI vendors signing C2PA that the registry missed: - Canva (Magic Media) signed "Canva" + trainedAlgorithmicMedia -> detected AI but no platform attributed (disproves the old "Canva exports strip C2PA" assumption). - BytePlus (ByteDance international: Seedream/Seededit) signs "Byteplus Pte. Ltd."; the bare volcengine needle missed it, so its output was mis-attributed to "Adobe Firefly" via an incidental "Adobe XMP" string the fallback byte-scan picked up. Adding both to C2PA_AI_VENDORS lets the clean manifest issuer attribute them directly. Corpus re-run: 16 platform changes, all improvements (3 Adobe->ByteDance fixes, 4 None/TC260->ByteDance, 9 None->Canva), 0 regressions. An attempted signer-based attribution fallback was measured and dropped: it regressed 18 images (friendly ByteDance label -> raw Chinese cert org; IPTC tool name pre-empted). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:57:21 -07:00
Victor Kuznetsov	61aa76a591	perf(identify): decode the image once for all visible-mark detectors identify(check_visible=True) ran the Gemini-sparkle detector and the Doubao/Jimeng text-mark detector each with its own image_io.imread, so the same bitmap was fully decoded twice. On a memory-constrained host (the raiw.cc 512 MB web worker, which runs identify on every upload) that doubled the peak decode allocation and contributed to OOM restarts. Decode once in identify() and pass the BGR array to both detectors. The detect methods already accept an NDArray, so this only threads the pre-decoded array through: detect_sparkle_confidence and the two _visible_* helpers gain an optional image= param that, when None, preserves the old self-read behavior (so direct callers and the cv2-missing/unreadable paths are unchanged). Only the visible path is deduplicated; the optional check_invisible decoders are unaffected (and off on the web hot path). Adds a test asserting identify(check_visible=True, check_invisible=False) decodes exactly once. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 11:13:17 -07:00
Victor Kuznetsov	ad7e4ee08b	feat(identify): close 3 detector gaps found on the spaces corpus (06-05..06-11) - AIGC: parse the bare ``AIGC{...}`` blob form (label glued to its JSON in a JPEG APP segment near the JFIF header), and scan both raw-JSON forms in one fall-through loop so a quoted ``"AIGC"`` later in an XMP packet no longer shadows a real bare label earlier in the file (3 files read unknown before). - Integrity clash rule 2: a camera device + an AI marker from the SAME C2PA manifest (Google Pixel Magic Editor / Pixel Studio edit chain) is a legitimate edit chain, not a contradiction. Fire only when the AI marker's source is independent of the camera's manifest; pure cameras (Leica/Sony/Nikon) are unaffected (2 Pixel files mis-flagged before). - New c2pa_cloud_manifest detector: surface a C2PA 2.4 Durable Content Credentials cloud-manifest reference (Adobe cai-manifests.adobe.com) as a medium provenance signal when the embedded manifest is stripped. Provenance only, never asserts is_ai (2 files read fully unknown before). identify reuses its already-loaded scan head for the cloud check (no second read). +7 tests; CLAUDE.md + README synced. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 09:28:15 -07:00
Victor Kuznetsov	df0fafe94e	fix(identify): stop flagging multi-actor C2PA manifests as integrity clashes The C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are derived from the same manifest, so treating them as independent signals made rule 1 fire on legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI, Microsoft on Google) or an edit chain re-signs (Adobe over a Gemini original). 19 such files in the 2026-06-01/02 spaces batches read as "likely spoofed/laundered" before this. Group `c2pa` + `synthid` into one provenance source via `_CLASH_SOURCE`; rule 1 now requires two vendors from different sources. A manifest vendor still clashes with a genuinely independent stamp (EXIF/XMP generator, IPTC AISystemUsed, AIGC, xAI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:02:35 -07:00
Victor Kuznetsov	e501bec9ff	feat(identify): detect visible Doubao/Jimeng marks; keep identify import torch-free identify previously ran only the Gemini sparkle as a visible detector, so a Doubao/Jimeng image with stripped TC260 metadata had no visible fallback. Add `_visible_text_marks` (registry-backed) so the ByteDance Doubao 豆包AI生成 and Jimeng 即梦AI marks are detected too, each gated by its own engine NCC threshold via MarkDetection.detected. New signals `visible_doubao` / `visible_jimeng` (medium), same stripped-metadata fallback role as the sparkle; excluded from integrity-clash vendor claims; set platform only when no harder signal did. Also make `noai/__init__` lazy (PEP 562 __getattr__): importing the light `noai.c2pa` / `noai.constants` submodules (which identify needs) no longer eagerly pulls `watermark_remover`, which imports torch + diffusers at module top. `import remove_ai_watermarks.identify` drops from ~420 MB to ~21 MB in a full gpu/detect install (torch not loaded), so it fits a 512 MB host; the removal API resolves lazily on first access. Guarded by TestIdentifyImportIsLight. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 20:43:52 -07:00
Victor Kuznetsov	e572767555	feat(visible): add Jimeng remover, fix Doubao outline defect, reproducible mask build Visible-watermark work across all three corner-mark engines plus a committed, reproducible alpha-build pipeline (scripts/visible_alpha_solve.py) fed by committed solid black/gray/white captures. - jimeng: new "即梦AI" wordmark remover (reverse-alpha + thin residual inpaint, always NCC-aligned -- the mark re-rasterizes/jitters per image). Detect via glyph silhouette NCC (0.45 threshold; does not cross-fire with Doubao). Registered in the visible-mark catalog; `visible --mark jimeng` / `--mark auto`. - doubao: fix a real production defect -- the shipped remover left a READABLE "豆包AI生成" outline on real samples while detect() returned conf 0.0 (fooled by a thin outline), so the test passed and the "56/56 clean" claim was detector-measured, not visual. Root cause: under-estimated alpha + fixed-geometry-no-inpaint + tight locate box. Rebuilt alpha (careful gray-self solve), always-align, thin inpaint, widened locate box -> readable outline becomes faint texture-level traces. - gemini: rebuild gemini_bg_{96,48} from our own controlled captures (validated NCC 0.9998 vs the prior third-party asset); removal re-verified clean, no behaviour change. - tests: add textured-shift regression to both engines (guards the align-on-shift path the Doubao defect exposed; lesson: a detector-only removal test is insufficient, assert visual residual). - docs: CLAUDE.md, README, capture READMEs and docstrings synced; stale "exact/pixel-exact/56-clean" claims removed. Also includes a SynthID label-wording clarification in identify.py/cli.py ("SynthID pixel watermark" -> "SynthID watermark, inferred from C2PA metadata"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 12:20:19 -07:00
Victor Kuznetsov	5d0e6c3a65	fix: harden metadata parsers and engines; sync docs (full-repo review) Apply fixes from a full-repo review (code, tests, docs). Security / correctness: - Clamp attacker-controlled PNG/caBX chunk lengths to the remaining file size in metadata.py and noai/c2pa.py (a malformed length no longer drives a multi-GB read); skipped chunks seek instead of read. - noai/isobmff.strip_c2pa_boxes is now fail-safe on a malformed box: return the original bytes with a warning instead of silently truncating the tail, so metadata --remove can no longer emit a corrupt file. - doubao_engine._fixed_alpha_map clamps the glyph box to the image (no crash on degenerate width-vs-height). - watermark_remover._run_region_hires gates the phaseCorrelate offset on response and magnitude (a spurious shift no longer garbles text) and drops the generator after a CPU fallback (no MPS/CPU device mismatch). Robustness: - gemini_engine, doubao_engine, region_eraser normalize grayscale and RGBA inputs to BGR at the engine entry points. - image_io.imwrite returns False on an unwritable path (matches cv2). - invisible_engine guards a None imread result before use. - trustmark_detector._decoder uses a double-checked threading lock. - ctrlregen.tiling.tile_positions raises on overlap >= tile. - humanizer chromatic shift no longer wraps opposite-edge pixels. - identify OpenAI caveat keyed on the normalized vendor, not a substring. - Remove the dead "visible --detect-threshold" CLI option. - publish.yml verifies the release tag matches the package version. Docs: - README strength 0.05 to 0.10; .env.example HF_TOKEN marked optional; doubao_capture README updated to reverse-alpha-only; CLAUDE.md synced with the new behaviors and the batch command. Tests: new test_security_clamp.py for the read clamp and isobmff fail-safe; erase CLI coverage; integrity-clash rule 2 end-to-end; multi-tag EXIF survival and cross-format strip guards; channel/size, tiling, humanizer, and imwrite regressions. Full suite 493 passed, 2 skipped; ruff and pyright src/ clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 18:00:39 -07:00
Victor Kuznetsov	58bdf51c59	Visible-watermark registry: reverse-alpha-only Doubao + Gemini, exact native recovery (#28 ) * fix(trustmark): gate detection on re-encode durability to kill false positives TrustMark's wm_present flag is a BCH validity check that spuriously validates on a content-correlated fraction of un-watermarked images (AI textures trip it more than camera photos). On a 1343-image set all 20 raw detections were false, several on Gemini/OpenAI/Doubao output that cannot carry Adobe's watermark, with random-bytes secrets. A genuine TrustMark is a durable soft binding that survives re-encoding, so detect_trustmark now re-decodes after a mild JPEG round-trip and requires the same schema both times. Every observed false positive collapsed under this gate; the second decode runs only on the rare hit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(identify): Samsung Galaxy AI, FLUX, ByteDance C2PA; fix C2PA substring FP Detection extensions verified on real signed files (2026-05-29): - Samsung Galaxy AI: signer attribution via a new _SIGNER_C2PA_PLATFORM (Samsung Galaxy / ASUS Gallery) kept separate from the capture-camera _DEVICE_C2PA_PLATFORM so a Galaxy AI edit (device cert + AI source type) does not trip the camera-vs-AI integrity clash. Plus metadata.samsung_genai: the proprietary genAIType marker in PhotoEditor_Re_Edit_Data, a medium- confidence AI-editing signal (samsung_only branch). - Black Forest Labs (FLUX) and ByteDance Volcano Engine (Doubao/Jimeng) added as C2PA issuers + issuer->platform mappings. - fix: C2PA presence required only the bare 4-byte 'c2pa' substring, which false-positives on compressed pixel data (a recompressed PNG IDAT re-flagged C2PA after its manifest was correctly stripped). New c2pa_marker_in() requires the JUMBF wrapper (jumb+c2pa) or the C2PA uuid box; applied in identify + metadata. Verified: all 535 real C2PA files carry jumb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(doubao): gate detection on text structure to cut ~95% of false positives (#23) Coverage alone over-fired: any textured bottom-right corner cleared the threshold, so the detector false-positived on ~28% of arbitrary images. The real '豆包AI生成' mark is six glyphs in one row, so detect now also requires the text-structure signature (_glyph_structure): many connected components, no single dominant blob, concentration in a thin horizontal band. False positives dropped 343 -> 17 across the corpus while keeping real-mark recall and the doubao-1.png sample. Also accept a no-op force kwarg for remover-interface symmetry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(samsung): add Samsung Galaxy AI visible-badge remover New samsung_engine.py removes the bottom-left sparkle + localized 'AI-generated content' badge that Galaxy AI tools stamp. Mirrors the Doubao locate->mask->inpaint pattern but bottom-left, with a dual-polarity top-hat mask (the badge is light-on-dark or dark-on-light). Detection gates on a band + left-anchor signature (the Doubao CJK-component gate does not transfer: Latin badge letters connect into few blobs). Explicit-only -- tuned on few real badges with a ~4% FP floor, so it is not used in auto. Synthetic byte-blob fixtures (real badges are user content, not shipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(visible): unified known-watermark registry + LaMa inpaint backend watermark_registry.py is a single catalog of known visible marks, each tying {usual location, in_auto flag, recovery strategy, detect adapter, remove adapter}: gemini (reverse-alpha, exact), doubao, samsung. cmd_visible is now registry-driven (best_auto_mark for --mark auto; mark_keys() feeds the CLI choices) -- the per-mark _run_doubao/_run_samsung helper branches are gone. Cross-engine confidences are not comparable, so the gemini adapter applies the corpus-validated 0.5 sparkle threshold for auto arbitration (its engine flag is loose and weakly fired ~0.36 on Doubao text, hijacking auto). --backend auto\|cv2\|lama chooses background reconstruction for the mask-based marks; auto = LaMa when onnxruntime is present, else cv2. For LaMa the mask is the FILLED glyph bounding box (sparse glyph masks leave anti-aliased edges behind). cv2 stays the zero-dependency fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: watermark registry, Samsung/FLUX/ByteDance detection, LaMa backend, trustmark gate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(doubao): exact reverse-alpha removal from captured alpha map The Doubao '豆包AI生成' mark is a fixed semi-transparent white overlay, so given its alpha map the original pixels are recovered exactly: original = (wm - alogo)/(1-a) -- no inpaint hallucination. The alpha map + logo colour were solved from real black+gray Doubao captures on a controlled background: on black captured = alogo, and the black/gray pair solves a per-pixel without assuming the logo colour (a_max~0.65, logo near-white); the white capture cross-validates (mark vanishes to a flat fill). Bundled as assets/doubao_alpha.png + geometry constants. remove_watermark_reverse_alpha applies it scaled to image width; exact at the captured width, so the registry routes doubao through it only when reverse_alpha_available (width within the calibrated band) and the mark is detected, falling back to mask inpaint (cv2/LaMa) otherwise. A light residual inpaint cleans the sub-pixel rescaling error. Add captures at more resolutions to widen exact coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(visible): reverse-alpha only -- drop inpaint removal + heuristic detection Per the principle that we only remove/detect what we can do exactly, the visible-mark path is now reverse-alpha only: - Doubao detect is reverse-alpha-consistent: match the bundled alpha glyph silhouette against the corner via TM_CCOEFF_NORMED (DETECT_NCC_THRESHOLD 0.4) -- keys on the '豆包AI生成' SHAPE, not coverage/structure heuristics. FP 7/1243 (0.6%). Removes the cv2 inpaint path + the _glyph_structure gate. - Registry is reverse-alpha only: dropped the cv2/LaMa backend (_glyph_remove, _lama_box_inpaint, default_backend, --backend) and the Samsung entry. Doubao outside the alpha resolution band is skipped, never inpainted. - Removed samsung_engine.py + tests + --mark samsung (no alpha map captured; Samsung C2PA/genAIType metadata detection in identify is unaffected). - The universal erase --region (cv2/LaMa) is unchanged -- arbitrary-region inpainting stays a user-directed tool, separate from the known-mark registry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(doubao): NCC sub-pixel alignment -> reverse-alpha at any resolution A pure width-scale of the captured alpha map is only sub-pixel-accurate at the captured width and leaves a faint ghost elsewhere. remove_watermark_reverse_alpha now registers the alpha glyph to the actual mark via a TM_CCOEFF_NORMED scale+position search (_aligned_alpha_map) before inverting the blend, so the single 2048 capture works at any resolution -- verified clean on the 1773x2364 (3:4) corpus size, the biggest coverage gap (23 files). reverse_alpha_available is now just 'asset present' (no width band); the registry still gates removal on detect so a clean corner is never touched. Drops the _ALPHA_WIDTH_TOLERANCE gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(doubao): keep native recovery exact -- fixed geometry at captured width Integer-pixel NCC alignment landed ~1px off at the captured width, degrading the otherwise-exact native reverse-alpha (synthetic recovery error 0.94 -> 1.39). remove_watermark_reverse_alpha now uses exact width-relative geometry within _ALPHA_NATIVE_BAND of the captured width and the NCC search only off it -- best of both: native back to 0.94, other resolutions still aligned. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(doubao): harden alignment -- try fixed+aligned, keep least residual (56/56) On a faint/busy-background mark the NCC alignment peak can wander a few px off the true mark and leave a residual (2/56 real corpus files). Off the captured width, remove_watermark_reverse_alpha now builds BOTH the fixed-geometry and the NCC-aligned alpha map, applies each, and keeps whichever leaves the least residual mark (re-detect confidence on the bare reverse-alpha) -- geometry wins on faint marks, alignment on clear ones, no magic threshold. Real-file round-trip now removes 56/56 detected Doubao clean across every corpus resolution (was 54). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(doubao): skip residual inpaint at native width for exact recovery At the captured width the fixed-geometry reverse-alpha is pixel-exact, so inpainting over it only replaced exactly-recovered interior pixels with a cv2 hallucination -- measured worse on a textured background (native error vs true bg 1.6 reverse-alpha-only vs 2.6 with the old always-on full-footprint inpaint). Native now returns the bare recovery untouched; off-native, where NCC alignment is only sub-pixel-approximate, the footprint inpaint stays to clean the seam. Real round-trip still 56/56 across all corpus resolutions; negatives 0/60, Gemini unaffected. Add test_native_returns_exact_reverse_alpha_no_inpaint as the regression guard. Sync CLAUDE.md + README (the table cell and prose described the pre-NCC "skipped off native / cv2-LaMa" behavior, now stale). Gitignore the session scheduled_tasks.lock, and add the text-protection research note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 19:49:09 -07:00
Victor Kuznetsov	223cbcf171	feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk (as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field so a generic AIGC key cannot false-positive. New huggingface_job() reads the hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal (parallel to the visible sparkle, never overriding a hard metadata verdict). Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3 hf-job files the library previously reported as Unknown. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:40:17 -07:00
test-user	18160fe269	feat(identify): integrity-clash detection for contradictory provenance (v0.6.7) Surface contradictions between independent provenance signals instead of collapsing to a single verdict -- a strong tell of spoofed, transplanted, or laundered metadata. Inspired by arXiv:2603.02378. Two rules in the new _integrity_clashes helper: - Conflicting AI-origin attributions: two or more distinct AI vendors named by independent generator stamps (e.g. a C2PA OpenAI manifest on an image whose EXIF says Make="Ideogram AI"). - Camera + AI: a camera-capture C2PA device (Pixel/Leica/Sony/Nikon/Truepic) coexisting with an AI-generation marker -- a genuine capture is not AI. High-precision by design: only hard generator stamps feed it (C2PA issuer when the source is AI, SynthID proxy, EXIF/XMP generator, IPTC AISystemUsed, xAI, AIGC). The fuzzy visible sparkle and the open invisible watermark are excluded -- the latter can be a by-product of our own SDXL removal pass. Vendor normalization (_vendor_of over _AI_VENDOR_TOKENS) keeps consistent signals from clashing (C2PA "Google (Gemini)" + SynthID-Google agree); the C2PA vendor is read from the issuer attribution, not the resolved platform, so a camera label like "Google Pixel" cannot mis-normalize to an AI vendor. Surfaced as ProvenanceReport.integrity_clashes (red in the table view, included in --json). 19 new tests; all real single-origin fixtures (chatgpt/firefly/ doubao/grok/mj) verified to produce zero clashes (false-positive guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:27:25 -07:00
test-user	e1c99b5937	fix(identify): gate C2PA issuer->generator attribution on AI source type (v0.6.5) Prevents an unmapped C2PA device whose manifest incidentally contains a mapped issuer substring (e.g. the "Adobe XMP" toolkit string in a Canon/Sony camera capture) from being mislabeled as that AI generator ("Adobe Firefly"). _attribute_platform now names a specific AI-generator platform only when the digital-source-type is trainedAlgorithmicMedia; otherwise it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type and is unaffected (verified: chatgpt-1.png->OpenAI, firefly-1.png->Adobe Firefly still attribute). Closes the only real downside of leaving Canon/Samsung/Bria device signers unmapped: detection and removal were already unaffected; now the platform label degrades gracefully too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 10:29:12 -07:00
test-user	9f93d9c0c5	feat(identify): add Sony C2PA device attribution, verified (v0.6.3) Adds Sony to _DEVICE_C2PA_PLATFORM, matching Sony's own `sony.sig` / `sony.cert` C2PA assertion namespace (NOT bare "Sony", which is a common EXIF Make). Verified against a real Sony-signed file (Sony PXW-Z300, signer "Sony Corporation") found in the Security4Media/c2pa-video-player repo. The sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the namespace. Verified device set is now Leica, Nikon, Google Pixel, Sony, Truepic. Canon / Samsung / Bria still have no public direct-download C2PA sample to verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:13:49 -07:00
test-user	64be9598f2	fix(identify): device-token-first C2PA attribution; add verified Pixel (v0.6.2) Replaces the claim-generator-string match with a distinctive device-token scan of the manifest bytes (_device_platform / _DEVICE_C2PA_PLATFORM), which is more robust: it catches devices where the generator name lives under a non-standard CBOR key (Pixel uses `claim_generator_info`, so it has no `claim_generator`). - Adds Google Pixel, verified against a real Pixel 10 Pro C2PA file (attached to c2pa-rs issue #1609/#1554): cert CN "Pixel Camera", digitalSourceType `computationalCapture` -> capture authenticity, not AI (is_ai stays None). - Token distinctiveness is load-bearing: bare "Truepic" matched the OpenAI chatgpt-1.png fixture (Truepic is a trust-chain signing authority), so the token is the specific "Truepic_Lens"; "Pixel Camera" (cert CN) not "Pixel". - Verified Leica/Nikon/Truepic/Pixel attribute correctly and OpenAI/Adobe/MJ do not regress. Sony/Canon/Samsung/Bria stay unmapped: no public direct- download C2PA sample exists to verify their in-manifest string. - Regression tests: device token beats incidental issuer mentions (Leica, Pixel-vs-Google). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:43:40 -07:00
test-user	dda2ee7fbb	fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1) Verified on real signed files that the issuer byte-scan mis-attributes multi-entity manifests: Leica read as "Truepic" (timestamp authority in the chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the claim generator (what produced the asset) and falls back to the issuer scan. - New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator read for non-PNG via the now-public c2pa.cbor_text_after. - Device tokens listed only where verified against a real C2PA file (Leica lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a real sample confirms the in-manifest string. Camera C2PA marks capture authenticity, so these never set is_ai. - cbor_text_after made public (was _cbor_text_after); call sites + tests updated. - Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 20:10:07 -07:00
test-user	c196a16900	feat: detect soft-binding vendors, IPTC 2025.1, video/audio C2PA, TrustMark (v0.6.0) Broadens metadata provenance coverage at the detection and container-strip level. Detection: - C2PA soft-binding `alg` -> forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...) via C2PA_SOFT_BINDINGS + soft_binding_vendors_in(); names the watermark vendor even when the watermark itself can't be decoded. - IPTC Photo Metadata 2025.1 AI-disclosure XMP fields (AISystemUsed etc.) via iptc_ai_system() + IPTC_AI_FIELD_MARKERS. - Adobe TrustMark open keyless decoder (trustmark_detector.py, optional extra `trustmark`) -- the watermark behind Adobe Durable Content Credentials. Detects provenance, not AI origin, so it does not assert is_ai. Removal / containers: - isobmff.strip_c2pa_boxes now also drops a top-level XMP uuid box that carries an AI label (matched by AI-marker content, byte-order-robust; plain XMP kept). - remove_ai_metadata routes MP4/MOV/M4V/M4A (and any ftyp-sniffed ISOBMFF) through the box stripper; raises a clear error for non-ISOBMFF audio/video (WebM/MP3/WAV) instead of crashing in the image path. Tests: soft-binding scan, IPTC element/attribute/presence, MP4 + M4A detect/ strip, ISOBMFF XMP surgical strip, content-sniff, unsupported-container guard, TrustMark absent-safety + identify integration. ruff clean; pyright clean on all new modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 17:56:48 -07:00
test-user	74618b91a7	feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5) xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist. Verified stable across 3 genuine generations (a real download previously read as unknown / "no AI metadata"). - metadata.xai_signature(): matches the Signature blob + UUID Artist pair; wired into has_ai_metadata, get_ai_metadata, and identify (platform "xAI (Grok / Aurora)"). - data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID is the public image id, not PII). - Tests: synthetic-fixture unit tests, real-sample assertion, identify integration (322 passing). Docs (research refresh, May 2026): - C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the embedded manifest is stripped). - New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02), South Korea AI Framework Act Art 31(3), California AB 853. - Hedge removal claims: defeating the SynthID verifier is not forensic invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:14:35 -07:00
test-user	03fb460f77	Track the labeled SynthID corpus; complete metadata-source test coverage Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:46:47 -07:00
test-user	c7f0d71f90	feat(identify): detect China TC260 AIGC label (Doubao et al.) China-served generators embed an XMP <TC260:AIGC>{"Label":"1",...} block (China's mandatory AI-content labeling, TC260 standard). Doubao (ByteDance) uses it -- verified on the real #13 sample. It's none of C2PA / SynthID / imwatermark / IPTC, so identify() previously returned unknown. - metadata: AIGC_MARKERS + aigc_label() (json-decodes the HTML-entity-encoded block); has_ai_metadata + get_ai_metadata now surface it. - identify: new 'aigc' signal -> is_ai True, platform 'China AIGC-labeled generator (TC260; e.g. Doubao)', carries the ContentProducer code. - Container-agnostic raw-byte scan, so it covers the whole China-AIGC ecosystem (Jimeng/Kling/Qwen/Ernie share the standard). - Tests: synthetic TC260 block (metadata + identify). Docs updated. Addresses #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:29:51 -07:00
test-user	ad3b8ee248	feat(identify): read EXIF Software / XMP CreatorTool generator tags Closes the documented gap where EXIF/XMP fields inside AVIF/HEIF/JXL went unparsed. metadata.exif_generator extracts the EXIF Software/Artist tag (via PIL+piexif, which opens AVIF natively) and the XMP CreatorTool (via a container-agnostic raw-byte scan that also covers HEIF/JXL that PIL can't open), and matches against AI_GENERATOR_TOKENS so only generator names (Firefly, DALL-E, Midjourney, ComfyUI, ...) fire -- a plain 'Adobe Photoshop' or 'GIMP' tag is not flagged. identify() surfaces it as a high-confidence signal and uses it for platform attribution when no C2PA names a platform, so an AVIF/HEIF whose only AI signal is an EXIF/XMP generator tag is now caught. Validated with synthesized fixtures (the 'no positive fixtures' blocker was self-imposed): real AVIF and JPEG written with EXIF Software via PIL, plus an XMP CreatorTool raw-scan fixture. Zero false positives across the 109-image corpus (real iPhone photos carry no AI generator token). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:56:39 -07:00
test-user	3a1c5427c8	feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label Collected live C2PA positives from Bing Image Creator and Stability Brand Studio (DreamStudio successor) and learned two things our scan got wrong: - Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs C2PA as 'Microsoft'. The scan caught it, but the platform label claimed 'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral: 'Microsoft (Bing Image Creator / Designer)'. - Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer and a platform mapping. Stability uses no SynthID and (on its current Stable Image model) no imwatermark watermark -- verified, both negative. Both ingested as SynthID-negative corpus fixtures (they are AI but not SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded design exports that strip C2PA, so a Canva sample would be inconclusive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:12:42 -07:00
test-user	27ad5b7645	feat(identify): detect open SD/SDXL/FLUX invisible watermark Research found one locally-fillable detection gap: Stable Diffusion, SDXL, and FLUX all embed an open DWT-DCT watermark via the invisible-watermark (imwatermark) library -- a PUBLIC decoder, no secret key, unlike SynthID. New invisible_watermark.py decodes the known fixed patterns (verified against upstream source: diffusers SDXL WATERMARK_MESSAGE, FLUX.2 src/flux2/watermark.py, and the 'StableDiffusionV1' default string) and identify() reports the scheme as a high-confidence signal. Verified locally end-to-end: embedding SDXL's exact 48-bit message and decoding it back recovers 48/48 bits; a clean image and our own fal-SDXL outputs decode to ~21/48 (no match). Caveat baked into the report: the watermark is fragile -- gone after JPEG q90 -- so it confirms origin only on pristine files; absence is never proof. imwatermark is an optional dep (extra 'detect'; pulls non-headless opencv), so the import is guarded and the signal is skipped when absent. CLI --no-visible now means metadata-only (skips both pixel-domain detectors). Also records the broader watermarking landscape in CLAUDE.md: which services are locally detectable (SD/SDXL/FLUX), C2PA-covered (Bing/Canva/ Getty/Shutterstock unsampled), or proprietary-only like SynthID (Amazon Titan/Nova, Kakao). Midjourney embeds neither C2PA nor an invisible mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:53:59 -07:00
test-user	1a9f3e4fe5	test(identify): cover provenance branches, CLI, sparkle helper Adds 20 tests around the new provenance path: - identify(): local SD/ComfyUI params -> local-pipeline attribution; visible-sparkle gating at the 0.5 threshold (mocked detector: above, below, unavailable, opt-out); metadata verdict not downgraded by a sparkle hit; OpenAI/SynthID caveats + dedup; ProvenanceReport is JSON-serializable (the CLI --json path); and the honest edge where a C2PA manifest without an AI source marker stays 'unknown'. - CLI 'identify': help, clean PNG, AI PNG platform, valid --json, missing file. - gemini_engine.detect_sparkle_confidence: float in range for a real image, None for an unreadable file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:27:00 -07:00
test-user	fa104bcade	feat(identify): provenance command (platform + watermark inventory) New 'identify' command and identify.py module: upload an image, get one ProvenanceReport answering where it was made and what watermarks it carries. Aggregates every locally-readable signal: - C2PA Content Credentials -> generating platform (issuer + generator). - IPTC digitalSourceType 'Made with AI' (Meta and others). - Embedded SD/ComfyUI generation parameters (local pipelines). - SynthID metadata proxy (Google / OpenAI C2PA companion). - Visible Gemini sparkle (cv2 fallback for the stripped-metadata case), promoted only at confidence >= 0.5 (corpus-tuned: Gemini sparkles score >= 0.56, non-sparkle <= 0.49). is_ai_generated is True or None, never asserted False -- stripped metadata leaves no local proof of a clean origin, so absence of signals is reported as 'unknown' with an explicit caveat. The SynthID pixel watermark remains locally undecodable; the report says so. Non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) get the same issuer + generator attribution via a binary scan (the caBX parser is PNG-only). The cv2 dependency is isolated in gemini_engine.detect_sparkle_confidence so identify.py stays type-clean. CLI supports --json and --no-visible. Validated against the 109-image corpus: 14/14 positives flagged AI, 93/94 negatives clean (the one 'neg' flagged is a Meta image that genuinely carries the IPTC tag -- correct), zero true errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 16:19:26 -07:00

24 Commits