diff --git a/CLAUDE.md b/CLAUDE.md index 88c58f8..5e22039 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -37,7 +37,7 @@ You are a **principal Python engineer** maintaining a CLI tool and library for r - `noai/c2pa.py` — PNG chunk parser; use `extract_c2pa_chunk(path)` to get raw caBX payload, `has_c2pa_metadata(path)` to detect. Do not reimplement chunk parsing. `extract_c2pa_info(path)` sets `synthid_watermark`/`synthid_vendors` when the manifest is signed by a SynthID-using vendor, and `soft_binding`/`soft_binding_vendors` when a `c2pa.soft-binding` `alg` names a forensic-watermark vendor (`soft_binding_vendors_in(buffer)` is the shared byte-scan, used by both the PNG parser and the non-PNG binary path). PNG/caBX chunk reads are clamped to the remaining file size (`safe_length = min(length, remaining)`; skipped chunks use seek) so a malformed huge `length` cannot drive a multi-GB allocation (shared safety discipline matching `isobmff.scan_c2pa_region`). - `noai/constants.py` — PNG_SIGNATURE, C2PA_CHUNK_TYPE, C2PA_SIGNATURES, and `C2PA_AI_VENDORS` — the single `C2paAiVendor` registry of C2PA-signing vendors (issuer byte, resolved org name, the `identify` platform label, and a `synthid` flag), from which `C2PA_ISSUERS`, `SYNTHID_C2PA_ISSUERS` (issuers that pair SynthID with C2PA: Google, OpenAI), and `identify._ISSUER_PLATFORM` are all **derived** — plus `C2PA_SOFT_BINDINGS` (soft-binding `alg` prefix → forensic-watermark vendor: Adobe TrustMark, Digimarc, Imatag, Steg.AI, Microsoft, ...). Add a new C2PA vendor as one `C2PA_AI_VENDORS` entry (never edit the derived dicts), a new soft-binding to `C2PA_SOFT_BINDINGS`; not inline. - `metadata.py` — `scan_head(path, size=1MB)` is the shared input for every C2PA/AIGC/IPTC byte scan: first `size` bytes plus the payloads of any provenance metadata found beyond that window — for ISOBMFF, the late provenance boxes from `isobmff.scan_c2pa_region` (catches a manifest after a large `mdat`); for **PNG**, the late `tEXt`/`iTXt`/`zTXt`/`eXIf`/`iCCP` chunks from `_png_late_metadata` (catches an XMP/EXIF packet appended after a large `IDAT`, e.g. a TC260 AIGC label at ~2.7 MB). Behavior-neutral (`f.read(size)`) for non-ISOBMFF inputs and for any file that fits within `size`. Use it instead of `open().read(1MB)` for any new marker scan. **Memoized per (path, size, mtime)** (added 2026-06-09, `_scan_head_cached` lru_cache, `maxsize=8`): one `identify`/`get_ai_metadata` call fans out to ~8 byte-scan detectors that each re-read the same file head, so the cache turns those into a single read; the mtime key invalidates on change, a stat failure falls back to an uncached read. `synthid_source(path)` returns the vendor name(s) if the C2PA manifest implies a SynthID pixel watermark, else None. Format-agnostic: PNG via the caBX parser, JPEG/WebP/AVIF/HEIF/JXL via a binary scan (C2PA marker + SynthID issuer + AI-source marker). `get_ai_metadata` surfaces the verdict, and `metadata --check` prints it as a callout. Both `get_ai_metadata` and `has_ai_metadata` guard the PIL open with `except Exception` (HEIC/unknown formats raise non-OSError) and fall through to the binary scan. `xai_signature(path)` detects xAI/Grok's EXIF-only scheme (`ImageDescription` = `Signature: ` + UUID `Artist`); it feeds `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify`. `iptc_ai_system(path)` detects the IPTC Photo Metadata 2025.1 AI-disclosure XMP properties (`IPTC_AI_FIELD_MARKERS` = `AISystemUsed`/`AISystemVersionUsed`/`AIPromptInformation`/`AIPromptWriterName`) and returns the `AISystemUsed` generator name (or `"fields present"`). `remove_ai_metadata` routes **ISOBMFF video** (`.mp4`/`.mov`/`.m4v`) through the same `isobmff.strip_c2pa_boxes` as AVIF/HEIF (MP4 is ISOBMFF), and `_scrub_ai_exif` removes the xAI signature + AI-generator EXIF tags on JPEG output. `strip_c2pa_boxes` is **fail-safe** on a malformed box: it returns the original bytes unchanged with a logged warning instead of truncating the tail to EOF (detection-only `scan_c2pa_region` still stops at a malformed box). `_png_late_metadata` clamps each late-chunk read to the remaining file size (`safe_length = min(length, remaining)`) so a malformed `length` cannot drive a multi-GB allocation, AND advances the cursor by `safe_length` (not the raw `length`) so an inflated length cannot jump past EOF and abort the scan, silently skipping a genuine AI-label chunk after it. -- `identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal. **Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here. **`import identify` is deliberately light** (~21 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports only the pure `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB. **C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`). **Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. **Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified. **Samsung Galaxy + ASUS Gallery live in a separate `_SIGNER_C2PA_PLATFORM` (scanned after `_device_platform`, before the issuer fallback), NOT in `_DEVICE_C2PA_PLATFORM`** — verified on real signed files 2026-05-29. Reason: a Galaxy phone stamps BOTH its device cert AND a `trainedAlgorithmicMedia`/genAIType AI marker on a Generative-Edit image, so treating it as a "genuine camera capture" would false-fire integrity-clash rule 2 on every Galaxy AI edit. The signer tokens (`b"Samsung Galaxy"` cert org — distinct from the EXIF `SM-xxxx` model string on ordinary Samsung photos; `b"com.asus.gallery"` claim generator) only resolve the platform label; the AI verdict still comes from the source-type / genAIType. ASUS Gallery is a C2PA-signed edit with no AI marker, so it attributes the platform without asserting `is_ai`. **Samsung's `genAIType` (in the proprietary `PhotoEditor_Re_Edit_Data` JSON) is an undocumented Galaxy-AI editing marker** (`metadata.samsung_genai`, gated on the `PhotoEditor_Re_Edit_Data` container; non-zero value = AI tool used, values {1,5} observed): medium-confidence because the field has no public spec (verified 2026-05-29: absent from C2PA spec + Samsung docs), but it co-occurred with `trainedAlgorithmicMedia` in 3/3 verified files that record a source-type and was the SOLE AI marker on a Galaxy S24 file that omits the source type. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel). **Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM`, editing-app/AI-device signer tokens to `_SIGNER_C2PA_PLATFORM`, generator/issuer platforms to the `C2PA_AI_VENDORS` registry in `constants.py` (which derives `_ISSUER_PLATFORM`), not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. **Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by **independent** signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with any AI-generation marker. **Independence is source-grouped (`_CLASH_SOURCE`, added 2026-06-02):** the C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are NOT independent — the proxy is inferred from the *same* manifest — so they share one source and two vendors named within a single manifest do not clash. This killed a false-positive class found on the spaces corpus: legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI → `OpenAI, Microsoft`; Microsoft on Google → `Microsoft, Google LLC, Google C2PA Core Generator Library`) or an edit chain re-signs (Adobe over a Gemini original → Adobe c2pa + Google synthid) — 19 such files across the 2026-06-01/02 batches read as clashes before the fix. Rule 1 still fires when a manifest vendor disagrees with a genuinely independent stamp (EXIF/XMP generator, IPTC `AISystemUsed`, AIGC, xAI); each non-`c2pa`/`synthid` family is its own source (`test_identify.py::TestIntegrityClashes::{test_multi_actor_manifest_no_clash,test_manifest_vendor_vs_independent_signal_clashes}`). Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). **High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`). +- `identify.py` — the OpenAI rollout caveat is keyed on `_vendor_of(synthid) == "OpenAI"` (not a raw substring over the issuer + verdict blob). `identify(path)` aggregates every locally-readable signal (C2PA issuer→platform, C2PA soft-binding forensic-watermark vendor, **C2PA cloud-manifest reference** via `metadata.c2pa_cloud_manifest` — signal `c2pa_cloud`, **medium**, provenance-only (does NOT set `is_ai`, excluded from `ai_from_metadata` + clash vendors): a C2PA 2.4 Durable-Content-Credentials case where the embedded manifest is stripped but an XMP `dcterms:provenance` pointer to the vendor's cloud manifest store (`_C2PA_MANIFEST_REPOSITORIES`, today `cai-manifests.adobe.com` → "Adobe Content Authenticity") survives, so the credentials stay recoverable server-side; only emitted when no embedded manifest already attributed the file — surfaced on 2 corpus PNGs 2026-06-10 that read fully `unknown` before, IPTC "Made with AI" + IPTC 2025.1 `AISystemUsed`, embedded SD/ComfyUI params, SynthID proxy, xAI/Grok EXIF signature via `metadata.xai_signature`, the China TC260 AIGC label via `metadata.aigc_label`, the HuggingFace `hf-job-id` job marker via `metadata.huggingface_job`, the Samsung Galaxy AI editing marker via `metadata.samsung_genai`, the visible marks — Gemini sparkle plus the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI / Samsung Galaxy AI "Contenuti generati dall'AI" text marks via the `watermark_registry` — open invisible watermark, Adobe TrustMark via `trustmark_detector`) into one `ProvenanceReport`. `is_ai_generated` is True or None (never asserted False — stripped metadata is not proof of clean origin). The `hf_job`, visible-mark, and Samsung `samsung_genai` signals are **medium** confidence: each lifts an otherwise-Unknown verdict to a tentative AI (`hf_only` / `visible_only` / `samsung_only`, parallel branches; `visible_only` fires on any `visible_*` signal) but is excluded from the high-confidence `ai_from_metadata` set, so none overrides a hard metadata signal. **Visible-mark detection** (`check_visible`, signals `visible_sparkle` / `visible_doubao` / `visible_jimeng` / `visible_samsung`): the Gemini sparkle keeps its own file-level path (`_visible_sparkle` → `gemini_engine.detect_sparkle_confidence`, promoted only at confidence ≥ `_SPARKLE_THRESHOLD` 0.5; corpus-tuned to separate Gemini sparkles ≥0.56 from non-sparkle ≤0.49), while Doubao/Jimeng/Samsung reuse the registry detectors (`_visible_text_marks` → `watermark_registry`, iterating `_VISIBLE_MARK_PLATFORM`), each gated by its own engine NCC threshold via `MarkDetection.detected` (Doubao 0.4, Jimeng 0.45, Samsung 0.4). Doubao/Jimeng are normally also caught by the TC260 AIGC metadata label and Samsung by its C2PA + `genAIType` marker, so the visible path is their stripped-metadata fallback. Visible marks set `platform` only when no harder signal already did, and (like the sparkle) are excluded from integrity-clash vendor claims. The cv2 dependency lives in the engines, not here. **`import identify` is deliberately light** (~21 MB; ~36 MB with cv2 loaded by a visible-mark run, ~106 MB for a full `check_visible` run): it imports only the pure `noai.c2pa`/`noai.constants` submodules, and `noai/__init__` is lazy (see "Test and lint"), so torch/diffusers are NOT pulled at import even in a full `gpu`/`detect` install — fits a 512 MB host. The heavy paths are opt-in: `check_invisible=True` needs the `detect`/`trustmark` extras (each pulls **torch**; TrustMark also **downloads weights**), so on a core-only deploy leave `check_invisible` off (it is a no-op there anyway). Before the lazy `__init__`, the mere presence of torch in the env inflated `import identify` to ~420 MB. **C2PA platform attribution is device-token-first, issuer-scan fallback** (`_device_platform` scans manifest bytes for `_DEVICE_C2PA_PLATFORM` tokens, then `_attribute_platform`/`_ISSUER_PLATFORM`). **Why, verified on real signed files 2026-05-26:** the old issuer-only byte-scan matched ANY issuer substring anywhere, so multi-entity manifests mis-attributed -- Leica→"Truepic" (a signing authority in the trust chain), Nikon→"Adobe Firefly" (XMP-toolkit "Adobe" + the sample's "Adobe_MAX" name), Pixel→"Google (Gemini)" ("Google LLC" cert org), Truepic→"Google". A distinctive device token wins instead. **Token distinctiveness is load-bearing:** bare `b"Truepic"` mis-fires (it appears in unrelated trust chains -- it mis-attributed the OpenAI `chatgpt-1.png` fixture), so the token is the specific `b"Truepic_Lens"` from the Lens SDK claim generator; likewise `b"Pixel Camera"` (cert CN) not bare `b"Pixel"`. `_DEVICE_C2PA_PLATFORM` lists ONLY tokens **verified against a real C2PA file**: Leica (`lc_c2pa`/`Leica Camera`), Nikon (`NIKON`), Pixel (`Pixel Camera` -- from a real Pixel 10 Pro file attached to c2pa-rs issue #1609/#1554), Sony (`sony.sig`/`sony.cert` -- Sony's own C2PA assertion namespace, verified on a real Sony PXW-Z300 file; NOT bare "Sony" which is a common EXIF Make), Truepic (`Truepic_Lens`). Canon/Bria have **no public direct-download C2PA sample** (checked exhaustively: GitHub issue/PR attachments, contentcredentials gallery, HF datasets -- all upload-to-verify or token-gated; Canon's only public file was a self-signed hobbyist CR3, not factory), so they stay unmapped until a real file is captured (same fixture discipline as Grok/Doubao). The Sony sample is video (MP4) -- our ISOBMFF C2PA path detects it; Sony Alpha stills likely share the `sony.*` namespace but are not separately verified. **Samsung Galaxy + ASUS Gallery live in a separate `_SIGNER_C2PA_PLATFORM` (scanned after `_device_platform`, before the issuer fallback), NOT in `_DEVICE_C2PA_PLATFORM`** — verified on real signed files 2026-05-29. Reason: a Galaxy phone stamps BOTH its device cert AND a `trainedAlgorithmicMedia`/genAIType AI marker on a Generative-Edit image, so treating it as a "genuine camera capture" would false-fire integrity-clash rule 2 on every Galaxy AI edit. The signer tokens (`b"Samsung Galaxy"` cert org — distinct from the EXIF `SM-xxxx` model string on ordinary Samsung photos; `b"com.asus.gallery"` claim generator) only resolve the platform label; the AI verdict still comes from the source-type / genAIType. ASUS Gallery is a C2PA-signed edit with no AI marker, so it attributes the platform without asserting `is_ai`. **Samsung's `genAIType` (in the proprietary `PhotoEditor_Re_Edit_Data` JSON) is an undocumented Galaxy-AI editing marker** (`metadata.samsung_genai`, gated on the `PhotoEditor_Re_Edit_Data` container; non-zero value = AI tool used, values {1,5} observed): medium-confidence because the field has no public spec (verified 2026-05-29: absent from C2PA spec + Samsung docs), but it co-occurred with `trainedAlgorithmicMedia` in 3/3 verified files that record a source-type and was the SOLE AI marker on a Galaxy S24 file that omits the source type. Camera C2PA marks capture authenticity, not AI (Pixel carries `computationalCapture`, not `trainedAlgorithmicMedia`), so these never set `is_ai` -- that stays driven by digital-source-type. `c2pa.cbor_text_after` (now public) is best-effort for the `generator` detail string only and can be None when the manifest keys it `claim_generator_info` (Pixel). **Issuer→generator mapping is `is_ai`-gated** (`_attribute_platform(issuers, is_ai=c2pa_is_ai)`): a specific AI-generator platform is named only when the digital-source-type is `trainedAlgorithmicMedia`; on a non-AI source an issuer substring is treated as incidental (an "Adobe XMP" toolkit string in an *unmapped* Canon/Sony capture would otherwise mislabel it "Adobe Firefly"), so it degrades to the neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the AI source-type, so it is unaffected (verified: chatgpt-1.png→OpenAI, firefly-1.png→Adobe Firefly still attribute). `_attribute_platform` defaults `is_ai=True` so the mapping stays unit-testable in isolation. Add capture-camera tokens to `_DEVICE_C2PA_PLATFORM`, editing-app/AI-device signer tokens to `_SIGNER_C2PA_PLATFORM`, generator/issuer platforms to the `C2PA_AI_VENDORS` registry in `constants.py` (which derives `_ISSUER_PLATFORM`), not inline. For non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) the caBX parser returns nothing, so issuer (`_issuers_in`) and generator (`_ai_tools_in`, reusing `C2PA_AI_TOOLS`) are recovered by binary-scanning the first MB. EXIF `Software` / `Make` / `Artist` / `ImageDescription` and XMP `CreatorTool` generator tags are read by `metadata.exif_generator` (PIL+piexif for any format PIL opens incl. AVIF, plus a container-agnostic XMP raw-byte scan that also covers HEIF/JXL), matched against `AI_GENERATOR_TOKENS` so ordinary editors (plain "Adobe Photoshop") and real-camera `Make` ("Apple"/"Canon") are not flagged. **Ideogram tags its output with EXIF `Make="Ideogram AI"`** (verified on a real download 2026-05-24) — that's why `Make` is read. **Integrity-clash detection** (`_integrity_clashes`, surfaced as `ProvenanceReport.integrity_clashes`, printed in red by `identify` and serialized to `--json`): contradictions between independent generator stamps are a laundering/spoofing tell. Two rules: (1) two or more distinct AI-origin vendors named by **independent** signals (e.g. C2PA OpenAI + EXIF `Make="Ideogram AI"`), and (2) a camera-capture C2PA device (`_DEVICE_C2PA_PLATFORM`) coexisting with an AI-generation marker **from a source INDEPENDENT of the camera's own manifest**. **Rule 2's independence gate (added 2026-06-11):** a device that both captures and runs on-device generative AI (Google Pixel Magic Editor / Pixel Studio) records the capture AND the AI edit in ONE C2PA manifest — so the AI vendor is named only from that same manifest (`c2pa` issuer + `synthid` proxy, both `c2pa_manifest` source) — a legitimate edit chain, NOT a clash. Rule 2 therefore fires only when some `ai_vendor_claims` family has a source `!= "c2pa_manifest"` (EXIF/XMP generator, IPTC, TC260 AIGC, a second manifest naming AI on a camera capture — the real laundering tell). This killed a false-positive class on the corpus: 2 real Pixel generative-edit PNGs (`computationalCapture` + `trainedAlgorithmicMedia` + "Applied imperceptible SynthID watermark" in one Google manifest) read as camera-vs-AI clashes before the gate. Pure cameras (Leica/Sony/Nikon/Truepic) that do NOT generate AI still clash on any within-manifest AI marker only if it is independent — they never legitimately carry one, so the gate is behavior-neutral for them while fixing Pixel (regression-guarded by `test_identify.py::TestIntegrityClashesHelper::{test_pixel_generative_edit_same_manifest_no_clash,test_camera_plus_independent_ai_marker_still_clashes}` + `TestIntegrityClashEndToEnd::test_pixel_generative_edit_no_clash`). **Independence is source-grouped (`_CLASH_SOURCE`, added 2026-06-02):** the C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are NOT independent — the proxy is inferred from the *same* manifest — so they share one source and two vendors named within a single manifest do not clash. This killed a false-positive class found on the spaces corpus: legitimate multi-actor manifests where a product wraps another vendor's engine (Microsoft Designer on OpenAI → `OpenAI, Microsoft`; Microsoft on Google → `Microsoft, Google LLC, Google C2PA Core Generator Library`) or an edit chain re-signs (Adobe over a Gemini original → Adobe c2pa + Google synthid) — 19 such files across the 2026-06-01/02 batches read as clashes before the fix. Rule 1 still fires when a manifest vendor disagrees with a genuinely independent stamp (EXIF/XMP generator, IPTC `AISystemUsed`, AIGC, xAI); each non-`c2pa`/`synthid` family is its own source (`test_identify.py::TestIntegrityClashes::{test_multi_actor_manifest_no_clash,test_manifest_vendor_vs_independent_signal_clashes}`). Vendor normalization is `_vendor_of` over `_AI_VENDOR_TOKENS` (so a C2PA "Google (Gemini)" issuer and a SynthID-Google proxy agree, while different vendors clash). **High-precision by design:** only hard generator stamps feed it (C2PA-issuer when source is AI, SynthID, EXIF/XMP generator, IPTC `AISystemUsed`, xAI, AIGC); the fuzzy visible sparkle and the open invisible watermark are **excluded** (the latter can be a by-product of our own SDXL removal pass). The c2pa vendor is classified from the issuer attribution / generator, NOT the resolved `platform` (a camera label like "Google Pixel" would mis-normalize to "Google"). All real single-origin fixtures (chatgpt/firefly/doubao/grok/mj) verified to produce **zero** clashes (false-positive guard in `test_identify.py::TestRealSamplesHaveNoClash`). - `watermark_registry.py` — **single catalog of known visible watermarks**, the unified "find known marks in their usual places, recognize, remove" entry. **Reverse-alpha based by policy**: a mark is listed only once a real alpha map has been captured for it, and removal inverts that map (`original = (wm - a*logo)/(1-a)`) — Gemini recovers cleanly with no inpaint (its sparkle alpha comes from a pure-black capture, so it is near-exact), while **Doubao, Jimeng, and Samsung all add an always-on THIN residual inpaint** over the glyph footprint (their text marks re-rasterize + jitter a few px per image, so a single capture cannot pixel-cancel them; the inpaint blends into the reverse-alpha-recovered pixels). Arbitrary-region inpainting still lives in `region_eraser`/`erase`. Each `KnownMark` ties a key to {usual `location`, `in_auto` flag, `recovery` (="reverse-alpha"), a `detect` adapter → uniform `MarkDetection`, a `remove` adapter}. Entries today: `gemini` (bottom-right sparkle), `doubao` (bottom-right "豆包AI生成"), `jimeng` (bottom-right "★ 即梦AI"), and `samsung` (bottom-**LEFT** "✦ Contenuti generati dall'AI", Samsung Galaxy AI, Italian locale). `detect_marks` scans all; `best_auto_mark` picks the highest-confidence detection. **Cross-engine confidences aren't directly comparable**, so the gemini adapter applies the corpus-validated 0.5 sparkle threshold (`_GEMINI_AUTO_MIN_CONF`) for its `detected` flag — otherwise the gemini engine's loose internal threshold weakly fires (~0.36) on the Doubao text and hijacks `auto`. The shape-keyed Doubao/Jimeng/Samsung NCC detectors don't cross-fire (jimeng scores ~0.22 on the Doubao strip, well under its 0.45 threshold; Samsung is bottom-left so it shares no corner with the others, and scored 0.0 on Doubao/Jimeng captures and they 0.0 on a real Samsung photo), so `auto` picks the right one. `cli.cmd_visible` is registry-driven: `--mark auto` → `best_auto_mark`, `--mark ` → that mark; `--mark` choices come from `mark_keys()`. **`cli._remove_visible_auto` is the shared visible-removal helper used by `cmd_all`/`cmd_batch` too** (they no longer hardcode `GeminiEngine`), so `all`/`batch` remove Doubao/Jimeng/Samsung text marks, not just the Gemini sparkle (regression-guarded by `test_all_visible_step_uses_registry`). The three text-mark adapters were consolidated 2026-06-09: a single `_text_mark(key, label, location)` builds the registry row from one parameterized `_text_mark_detect`/`_text_mark_remove` pair (reverse-alpha only when detected/forced AND `reverse_alpha_available`, else skipped — no inpaint); the gemini adapters stay bespoke. Add a new visible mark = one `_text_mark(...)` row + its `TextMarkConfig` (with a captured alpha map); do not re-add per-mark `if` branches or copy-paste adapters. **Alpha-on-save policy (issue #30):** `cli._write_bgr_with_alpha` rejoins the input's alpha plane **unchanged** — it must NOT zero alpha in the watermark bbox. Reverse-alpha (and `erase` inpaint) recover real pixels there, so zeroing alpha punched a transparent hole that renders as a solid **white box** on any non-transparent viewer (Gemini app exports are opaque RGBA, so every user hit it; regression-guarded by `test_visible_keeps_alpha_opaque_in_watermark_region`). The registry `remove()` still returns its region (used for `inpaint_residual` positioning), but the CLI no longer uses it to clear alpha. - `gemini_engine.py` — visible Gemini-sparkle remover/detector (cv2/numpy, no GPU). `detect_sparkle_confidence(path)` is the file-level entry point used by `identify.py`. The public entry points normalize a grayscale (2D) or RGBA (4-channel) input to BGR up front so a non-BGR image does not crash the cv2 pipeline. **Detection localization (issue #36):** `detect_watermark`'s global multi-scale NCC search applies a size weight (`(scale/96)**0.5`) that suppresses tiny-patch false positives but can let a larger, mediocre match (e.g. a bright collar in a portrait) outrank a small, near-perfect sparkle in the corner — so a faint sparkle on a busy background scored below threshold and read as clean (the regression osachub reported from widening the search window 256px->512px between v0.7.2 and v0.8.8). `_corner_promote` adds a bottom-right-corner raw-NCC pass on top of the global search: a match with raw NCC >= `_CORNER_PROMOTE_NCC` 0.85 that beats the global pick overrides it (it only ever replaces a lower-fidelity pick, so it cannot weaken an existing detection), rescuing the buried sparkle without reverting the wider window. The corner side is **relative-clamped** (`_CORNER_PROMOTE_FRAC` 0.20 of the short side, clamped to `[_CORNER_PROMOTE_MIN` 96, `_CORNER_PROMOTE_MAX` 384`]`): a fixed 256px is a true corner on a large image but covers ~70% of a small portrait, where a real photo raw-matches the star at ~0.81 (relative tightening drops that worst case to ~0.69, while the upper clamp stops the corner ballooning on huge images where a real photo reached ~0.83 at 512px). The 0.85 gate sits midway between the worst real-photo corner match (~0.78 across native + downscaled negatives) and a genuine faint sparkle (~0.93), so promotion adds true detections with zero corpus false positives (Gemini's sparkle sits ~60-160px from the corner at fixed margins, covered by the [96, 384] band at every measured size). Regression-guarded by `test_gemini_engine.py::TestCornerPromotion`. **Removal is reverse-alpha with an over-subtraction guard** (`remove_watermark` → `_reverse_alpha_blend`, else `_inpaint_footprint`): the sparkle alpha is computed (`alpha = max(R,G,B)/255`) from the bundled sparkle-on-black captures `assets/gemini_bg_{96,48}.png` (the capture max is ~130, NOT 255 — the sparkle is a ~51%-opaque white overlay, so `alpha` maxes at ~0.51, which is CORRECT for the capture, not under-exposed). The alpha is near-exact only when the real mark's effective opacity matches the capture, which holds on bright/flat backgrounds — re-verified clean on `demo_banana_before.png` 2026-05-31. **Issue #30 (dark-background black pit):** on a dark/textured background (e.g. grass, ~73) the real sparkle's effective opacity is LOWER than the captured 0.51, so the fixed-alpha reverse blend OVER-subtracts (`watermarked - a*logo` goes negative) and drives the footprint to black — the white sparkle becomes a black diamond. `remove_watermark` now detects this via `_reverse_alpha_oversubtracts` (fraction of footprint pixels with `alpha >= _FOOTPRINT_ALPHA` 0.1 whose numerator < 0 exceeds `_OVERSUB_FOOTPRINT_FRAC` 0.05) and **inpaints the footprint** (`_inpaint_footprint`, cv2 NS over the dilated alpha mask) from the surrounding pixels instead. **Behavior-neutral on the working case:** a bright background over-subtracts at ~0% so reverse-alpha is used and the output is byte-identical to before (verified: demo_banana 0.0 frac vs issue-#30 grass 0.61 frac; regression-guarded by `test_gemini_engine.py::TestOverSubtractionGuard`, which composites the sparkle at a reduced effective alpha to reproduce the mismatch). **Under-subtraction (the symmetric case, fixed 2026-06-03):** some real Gemini sparkles are rendered MORE opaque than the captured ~0.51, so the fixed-alpha reverse blend UNDER-subtracts and leaves a bright sparkle residual the detector still fires on (measured on the spaces corpus: a visible-removal audit through the registry path left a detectable sparkle on a meaningful fraction of marks, all under-removals, NOT a background-brightness class — failures and successes had the same input confidence and the same background-luma distribution; the discriminator was the removal delta itself). `remove_watermark` now estimates a per-image alpha gain (`_estimate_alpha_gain`: effective sparkle opacity at the bright core vs the local background ring, `a_eff/a_cap`, clamped `[1.0, _ALPHA_GAIN_MAX` 1.94`]`) and scales the alpha to match before the over-sub/blend branch. The gain cleanly separates on the corpus (under-removed marks ~1.47, cleanly-removed ~1.00), and a deadband (`_ALPHA_GAIN_DEADBAND` 1.05) keeps a matching sparkle **byte-identical** to the pre-fix output, so the fix is purely additive (0 regressions on the audit set; the over-sub guard still runs on the scaled alpha as the safety net for an over-shooting estimate). Regression-guarded by `test_gemini_engine.py::TestUnderSubtractionGain` (composites a more-opaque-than-capture sparkle; **asserts on footprint pixels, NOT the detector** — the detector's NCC is degenerate on a flat synthetic background, so a re-detect conf is meaningless there; the real corpus removal drops the detector from ~0.80 to ~0.27). **False-positive gate (added 2026-06-03):** `detect_watermark`'s shape-only NCC (`spatial*0.5 + gradient*0.3 + var*0.2`) fires on ornate/flat content (text strips, banners, hatching) that coincidentally matches the diamond shape — a real Gemini sparkle is a bright WHITE overlay, so its core sits above the local background, but the NCC is contrast-invariant and cannot see that. The fusion now **demotes** (caps confidence to 0.30) any match that is BOTH low-confidence (`< _SPARKLE_FP_CONF` 0.65) AND has a low core-ring brightness margin (`_core_ring_margin < _SPARKLE_FP_MARGIN` 5). Real sparkles escape via EITHER high confidence (white-bg sparkles score ≥0.79 despite a low margin — the NCC shape match is strong) OR high margin (dark/mid backgrounds, incl. the #36 faint-corner case, lift well clear), so BOTH must fail to demote. The gate is **monotonic** (only ever removes detections, never adds), so it cannot regress the verified-negative corpus (already 0 FPs). On the spaces corpus it demoted 16/495 flagged sparkles (13 carried no AI metadata = content FPs; the 3 AI-meta were visually FPs / a near-invisible white-on-white sparkle whose AI verdict is held by metadata anyway), and dropped the removal-audit failures 20→15 (post-removal flat footprints the NCC re-fired on). `_core_ring_margin` and `_estimate_alpha_gain` share the `_core_and_bg` helper (core 75th-pct brightness vs background-ring median). Regression-guarded by `test_gemini_engine.py::TestSparkleFalsePositiveGate`. **Self-verify repair (added 2026-06-04):** the gain estimate corrects most under-subtractions, but a tail of strong sparkles still survived reverse-alpha (position jitter, or a gain the `[1.0, 1.94]` clamp could not fully reach). After the reverse blend, `remove_watermark` re-detects via `_verify_and_repair`; when a sparkle at or above `_VERIFY_FALLBACK_CONF` 0.5 (the registry's real fail line) remains, it inpaints the footprint and **keeps that only when it lowers the re-detect confidence** — purely additive (the common clean removal re-detects below 0.5 and is returned untouched, so it can never regress). On the spaces corpus this rescued **4 of the 15 remaining gemini removal-audit failures** (15→11, doubao/jimeng still 0), verified through the registry/CLI path. Costs one extra `detect_watermark` per removal (two when the fallback fires). Regression-guarded by `test_gemini_engine.py::TestVerifyAndRepair` (stubs `detect_watermark` to drive the keep-best control flow, since the NCC is degenerate on flat synthetics). **An offset+scale alignment search was prototyped on the remaining 11 fails and REJECTED (2026-06-04):** an audit "ceiling" test suggested it could rescue 4 more (e.g. a5a9 0.577→0.417), but direct inspection showed those were NCC-gaming, not removal — the lower-scoring placement left the sparkle as bright or BRIGHTER (a5a9: first-pass slot 99.5th-pct ~76 at background level, the "aligned win" slot ~164), it just reshaped the residual so the contrast-invariant shape-NCC scored lower. A slot-brightness sanity gate rejected every one, so alignment contributed 0 genuine rescues and was removed (the footprint inpaint stays because it physically reconstructs the slot from its darker surroundings, so its rescues are real). **Lesson: the visible-audit pass/fail metric (re-detect conf < 0.5) is gameable by reshaping the residual — optimizing it directly finds NCC-gaming placements, not clean removals; gate any removal candidate on a physical brightness check, not the detector alone.** The 11 survivors are near-white ill-conditioning (reverse-alpha divides by `1-a`≈0.02) or detector false positives (before≈after≈0.51) that no reverse-alpha placement fixes. The registry's optional `inpaint_residual` (edge cleanup) is a no-op on a clean reverse-alpha removal (and on the same corpus it lowered the re-detect conf on 3 marks, raised it on 10, no-op on 466 — net-neutral on pass/fail, so the self-verify repair, not it, drives the removal tail); an earlier "Gemini smears" read was a misjudged soft-fur original, not an artifact. **The bg assets are now rebuilt from OUR OWN controlled captures** (`data/gemini_capture/captures/`, committed) by `scripts/visible_alpha_solve.py gemini`, which locates the 96px sparkle on the black capture and crops it to the two logo sizes; our capture matched the previously third-party-sourced `gemini_bg_96.png` to **NCC 0.9998**, validating the asset and making it reproducible. Gemini's multi-size fixed-slot model is genuinely different from the Doubao/Jimeng text-strip engines (so it stays a separate engine, not part of the shared-base refactor). - `_text_mark_engine.py` — **shared base for the three reverse-alpha text-mark engines (Doubao/Jimeng/Samsung), extracted 2026-06-09** (they were ~90% byte-identical clones). `TextMarkEngine(config: TextMarkConfig)` owns the whole `locate → extract_mask → detect → _fixed/_aligned_alpha_map → _apply_reverse_alpha → remove_watermark_reverse_alpha` pipeline (+ the asset-keyed `load_alpha_template`/`glyph_silhouette`/`template_match_score` caches). Each engine module is now a thin subclass: it supplies only its `TextMarkConfig` (the tuned constants, the bundled asset, and the bounded structural deltas — `corner` br/bl, `margin_floor` 4/2, `morph_open_size` 5/3, `min_gw` 8/16) plus the test-facing module shims (`_alpha_template`/`_glyph_silhouette`/`_template_match_score` + the constants). Behavior is byte-exact vs the old per-engine code (the three engine test suites pass unchanged). Gemini stays a SEPARATE engine (its multi-size fixed-slot sparkle model is genuinely different). Add a new text mark = a new `TextMarkConfig` + a thin subclass + one registry `_text_mark(...)` row. The engine bullets below describe each mark's calibration history; the LOGIC lives here. @@ -71,7 +71,7 @@ Who embeds what, and whether it is locally detectable (so we know which gaps are - **C2PA / IPTC (covered by the issuer/marker scan):** OpenAI, Google, Adobe Firefly, Microsoft (Designer + **Bing Image Creator** — collected 2026-05-24; Bing now runs Microsoft's own **MAI-Image** model, signs C2PA as "Microsoft", NOT OpenAI/DALL-E), and **Stability AI** (collected from Brand Studio / DreamStudio successor; signs C2PA as "Stability AI Ltd", no SynthID, no imwatermark on its current Stable Image model — issuer added to `C2PA_ISSUERS`). Still unsampled: Canva (its downloads are re-encoded design *exports* that strip C2PA, so a Canva "positive" is inconclusive — skipped), Getty, Shutterstock. Midjourney embeds NO C2PA and no invisible watermark (our `mj-*` sample carried only the IPTC tag). **Samsung Galaxy AI** (Generative Edit / Sketch to Image / Portrait Studio on Galaxy S23 FE / S24 / S25, One UI 7+) signs C2PA as "Samsung Galaxy" with the standard `trainedAlgorithmicMedia` source type AND a proprietary `genAIType` marker; verified on real signed files 2026-05-29 (the standard scan catches the source type; `genAIType` additionally catches a Galaxy S24 file that omits it). It ALSO burns a **visible** localized wordmark into the pixels — a sparkle + "generated with AI" string in the bottom-LEFT corner (issue #37; the Italian "✦ Contenuti generati dall'AI" variant is calibrated) — removed by `samsung_engine.py` / `visible --mark samsung` (reverse-alpha, see the engine bullet); detection feeds `identify` as the medium `visible_samsung` signal. The string is locale-specific, so each locale needs its own captured alpha template. **ASUS Gallery** also signs edited photos as C2PA (`com.asus.gallery`) but with no AI source type — a signer, not an AI marker. **Black Forest Labs (FLUX)** API output signs C2PA: `claim_generator_info "Black Forest Labs API"` + a `c2pa.ai_generated_content` assertion + `trainedAlgorithmicMedia` (issuer `b"Black Forest Labs"` added to `C2PA_ISSUERS`, platform "Black Forest Labs (FLUX)"). **ByteDance Volcano Engine (Volcengine)** — the cloud behind Doubao / Jimeng — signs its AI image output with a cert from `certificate_center@volcengine.com` + `trainedAlgorithmicMedia` (issuer `b"volcengine"` → "ByteDance (Volcano Engine)", platform "ByteDance (Doubao / Jimeng / Volcano Engine)"); note this is the C2PA-signed surface, distinct from the XMP/PNG TC260 `AIGC` label Doubao also uses. All three verified on real signed files 2026-05-29. - **EXIF/XMP generator tag (caught by `exif_generator`):** **Ideogram** writes EXIF `Make="Ideogram AI"` (collected 2026-05-24 — no C2PA, no SynthID, no imwatermark; the Make tag is the only signal). - **xAI / Grok — its own EXIF signature scheme, NOT C2PA (DETECTED by `metadata.xai_signature`, built 2026-05-26).** Grok JPEG downloads (Aurora model) carry **no C2PA, no XMP, no SynthID, no IPTC** — only EXIF `Artist` = a UUID and EXIF `ImageDescription` = `Signature: ` (a crypto signature, unverifiable locally without xAI's public key). This empirically kills the earlier unverified "xAI signs C2PA as xAI" lead — xAI is not even a C2PA member. `exif_generator` misses it (neither field holds an `AI_GENERATOR_TOKENS` token), so a dedicated detector `xai_signature(path)` matches the pair (`ImageDescription ~ ^Signature: [A-Za-z0-9+/=]{64,}` AND UUID `Artist`); wired into `has_ai_metadata`, `get_ai_metadata` (key `xai_signature`), and `identify` (signal `xai_signature`, platform "xAI (Grok / Aurora)"). **Format confirmed stable across n=3 genuine generations:** exactly three EXIF tags (`Artist`, `ExifOffset`, `ImageDescription`), `Signature:` prefix constant, base64 payload 300-1004 chars. Two capture facts: (a) the `Artist` UUID **equals the public image id** in the asset URL (`https://imagine-public.x.ai/imagine-public/images/.jpg`), so it is NOT a private per-user secret — only the `Signature` blob is; (b) the Grok web-UI image is a re-encoded **WebP with no signature** — the EXIF survives only in the *original* JPEG (download button or that public tokenless URL), which is why screenshots / re-encodes are metadata-stripped. A real fixture `data/samples/grok-1.jpg` plus **synthetic** JPEG fixtures (fake UUID + fake `Signature:` blob) cover the detector; never add a real Grok image carrying private content (the repo is public). **Stripped on removal too:** `remove_ai_metadata` now calls `_scrub_ai_exif` on the JPEG EXIF, which deletes the xAI Signature+UUID-Artist pair **and** any `Software`/`Make`/`Artist`/`ImageDescription` tag holding an `AI_GENERATOR_TOKENS` token (so Ideogram's `Make="Ideogram AI"` is scrubbed too), while keeping genuine camera/editor EXIF. The shared `_is_xai_signature_pair` helper (module-level compiled regexes) is the single source of truth for the pattern, used by both `xai_signature` and `_scrub_ai_exif`. (AVIF/HEIF/JXL still strip only C2PA boxes via `isobmff`, not EXIF — unchanged.) -- **China TC260 AIGC label (caught by `AIGC_MARKERS` / `metadata.aigc_label`, surfaced by `identify` as the `aigc` signal):** China-served generators embed an XMP `{"Label":"1","ContentProducer":...}` block — China's mandatory AI-content labeling (TC260 namespace `tc260.org.cn/ns/AIGC`). **Doubao** (ByteDance) uses it (verified on the real #13 sample 2026-05-25; `ContentProducer` `001191110102MACQD9K64010000`, no C2PA/SynthID/imwatermark — the XMP block is the only signal; GitHub attachment upload did NOT strip it). The same standard is mandatory for Jimeng/Kling/Qwen/Ernie etc., so the one marker covers the whole China-AIGC-labeled ecosystem. `aigc_label` reads **three serializations** through a shared `_parse` helper: the HTML-entity-encoded XMP `TC260:AIGC` block in **either RDF form** — the nested element `{...}` (Doubao) or the attribute `TC260:AIGC="{...}"` (**PicWish**, `ContentProducer="picwish"`, verified on the corpus 2026-05-30) — via a container-agnostic raw-byte scan (any JSON object accepted), a raw-JSON PNG `AIGC` tEXt chunk (Doubao also writes the label this way, no namespaced marker at all — confirmed on the corpus 2026-05-28, `ContentProducer="doubao"`), **and** a bare raw-JSON `{"AIGC":{...}}` object embedded in **JPEG EXIF (UserComment)** by some China-served generators, brace-matched from the scan head with `json.JSONDecoder().raw_decode` (no namespaced marker, no PNG chunk — confirmed on the corpus 2026-05-30, `ContentProducer="001191440300708461136T1308L"`). Both generic forms (the PNG chunk and the bare `{"AIGC":...}` object) are gated on at least one TC260 field (`_TC260_FIELDS`) so a generic `AIGC` key cannot false-positive; the namespaced XMP element is unambiguous and needs no gate. In `identify`, `aigc` fires on the parsed label **or** the `AIGC_MARKERS` byte scan (the latter preserves the laundering-tell case where the JSON payload is truncated). +- **China TC260 AIGC label (caught by `AIGC_MARKERS` / `metadata.aigc_label`, surfaced by `identify` as the `aigc` signal):** China-served generators embed an XMP `{"Label":"1","ContentProducer":...}` block — China's mandatory AI-content labeling (TC260 namespace `tc260.org.cn/ns/AIGC`). **Doubao** (ByteDance) uses it (verified on the real #13 sample 2026-05-25; `ContentProducer` `001191110102MACQD9K64010000`, no C2PA/SynthID/imwatermark — the XMP block is the only signal; GitHub attachment upload did NOT strip it). The same standard is mandatory for Jimeng/Kling/Qwen/Ernie etc., so the one marker covers the whole China-AIGC-labeled ecosystem. `aigc_label` reads **four serializations** through a shared `_parse` helper: the HTML-entity-encoded XMP `TC260:AIGC` block in **either RDF form** — the nested element `{...}` (Doubao) or the attribute `TC260:AIGC="{...}"` (**PicWish**, `ContentProducer="picwish"`, verified on the corpus 2026-05-30) — via a container-agnostic raw-byte scan (any JSON object accepted), a raw-JSON PNG `AIGC` tEXt chunk (Doubao also writes the label this way, no namespaced marker at all — confirmed on the corpus 2026-05-28, `ContentProducer="doubao"`), a bare raw-JSON `{"AIGC":{...}}` object embedded in **JPEG EXIF (UserComment)** by some China-served generators, brace-matched from the scan head with `json.JSONDecoder().raw_decode` (no namespaced marker, no PNG chunk — confirmed on the corpus 2026-05-30, `ContentProducer="001191440300708461136T1308L"`), **and** a bare `AIGC{...}` blob (the label glued straight to its JSON, no `"AIGC":` key wrapper) embedded in a **JPEG APP segment near the JFIF header** — confirmed on the corpus 2026-06-10 (`ContentProducer="00119144030008867405X210002"`; 3 files read `unknown` before this form was added). The two raw-JSON forms are scanned in one loop (`'"AIGC"'` then `AIGC{`) that **falls through on a non-TC260 / undecodable hit instead of returning** — a quoted `"AIGC"` can appear later in an XMP packet while the real label is a bare `AIGC{...}` earlier in the file, so an unconditional early return on the quoted form would shadow the bare form (the exact bug behind the 06-10 misses). All three generic forms (the PNG chunk, the bare `{"AIGC":...}` object, and the bare `AIGC{...}` blob) are gated on at least one TC260 field (`_TC260_FIELDS`) so a generic `AIGC` key cannot false-positive; the namespaced XMP element is unambiguous and needs no gate. In `identify`, `aigc` fires on the parsed label **or** the `AIGC_MARKERS` byte scan (the latter preserves the laundering-tell case where the JSON payload is truncated). - **HuggingFace-hosted job (caught by `metadata.huggingface_job`, surfaced by `identify` as the `hf_job` signal, MEDIUM confidence):** HuggingFace Jobs / Spaces stamp generated PNGs with an `hf-job-id` tEXt chunk holding the job UUID (3 on the corpus 2026-05-28, no other signal). It marks the *hosting job*, not a model — most commonly diffusion output — so it lifts an Unknown verdict to a tentative AI via `hf_only` (parallel to the visible sparkle) but never overrides a hard metadata signal; `_HF_JOB_CAVEAT` states the limit (job, not model; not proof of AI pixels). Stripped on removal (the PNG save whitelist keeps only `STANDARD_METADATA_KEYS`, so `hf-job-id` and the `AIGC` chunk are both dropped). The exact writer is not authoritatively documented (HF Jobs are generic GPU jobs), hence medium not high. - **No detectable signal on download (correctly reported `unknown`):** **Recraft** (PNG export is a re-encoded design export — strips everything), **Krea hosting FLUX 2** (no imwatermark despite FLUX — the host omits the encoder, same as Stability's hosted SDXL), and Midjourney (embeds nothing). Lesson: the imwatermark detector only fires on *pristine* output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). - **Invisible but NOT locally detectable (proprietary, API/oracle only — same wall as SynthID):** Amazon Titan Image Generator + Nova Canvas (Bedrock `DetectGeneratedContent` API), Kakao (new SynthID image adopter, May 2026), NVIDIA Cosmos (SynthID video). No local detector possible; treat like SynthID. diff --git a/README.md b/README.md index e7ca20b..8369238 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu - **Text and face preservation (default)** — the default pipeline is a canny ControlNet that keeps text and face structure sharp through the removal pass (without copying original pixels, so SynthID is still removed). Use `--pipeline sdxl` for plain SDXL img2img (lighter, no extra model download) on inputs without text or faces. Canny preserves face *structure*, not *identity* (the regenerated face drifts in likeness). The library does not ship a face-restore extra: every approach evaluated (GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned) regenerated the face via SDXL and made the output look more AI-generated than the cleaned image. The cleaned controlnet output is the least-AI face state achievable without re-introducing SynthID. - **Batch processing** — process entire directories - **Detection** — three-stage NCC watermark detection with confidence scoring -- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output) +- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, EXIF, or JPEG segment), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the C2PA cloud-manifest reference (Adobe Durable Content Credentials, when the embedded manifest is stripped), the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output) ## Examples @@ -62,7 +62,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu > Visible overlays are used by Google Gemini / Nano Banana (sparkle logo), by ByteDance's Doubao ("豆包AI生成" corner text) and Jimeng / Dreamina ("★ 即梦AI" wordmark), and by Samsung Galaxy AI ("✦ Contenuti generati dall'AI" strip, bottom-left, locale-specific). All are removed on CPU by reverse-alpha against a captured alpha map (Jimeng and Samsung add a thin residual inpaint over the glyph footprint, since their marks re-rasterize per image). Other services rely on invisible watermarks and/or metadata; our diffusion-based regeneration works against any invisible watermark in pixel or frequency domain. For a visible mark from any other source (any position, any colour), use the universal `erase --region` command. -> **Detection:** `remove-ai-watermarks identify ` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only. +> **Detection:** `remove-ai-watermarks identify ` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, EXIF, or JPEG segment), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the C2PA cloud-manifest reference (Adobe Durable Content Credentials, when the embedded manifest is stripped), the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only. ## How it works diff --git a/src/remove_ai_watermarks/identify.py b/src/remove_ai_watermarks/identify.py index 419ec6f..bade54e 100644 --- a/src/remove_ai_watermarks/identify.py +++ b/src/remove_ai_watermarks/identify.py @@ -30,6 +30,7 @@ from remove_ai_watermarks.metadata import ( IPTC_AI_FIELD_MARKERS, IPTC_AI_MARKERS, aigc_label, + c2pa_cloud_manifest_in, c2pa_marker_in, exif_generator, get_ai_metadata, @@ -96,6 +97,13 @@ _HF_JOB_CAVEAT = ( "generation) but names neither the model nor the content type, so it is a " "medium-confidence signal, not proof the pixels are AI-generated." ) +_C2PA_CLOUD_CAVEAT = ( + "The embedded C2PA manifest is absent but an XMP provenance pointer to the " + "vendor's cloud manifest store survives, so the Content Credentials remain " + "recoverable server-side -- stripping the file no longer removes the provenance. " + "It marks Content Credentials, not AI origin: the cloud manifest may describe a " + "human edit, and reading it needs a network fetch this tool does not make." +) _SAMSUNG_GENAI_CAVEAT = ( "Samsung's genAIType marker shows a Galaxy AI editing tool (Generative Edit, " "Sketch to Image, ...) touched the image; it is an undocumented proprietary " @@ -285,7 +293,12 @@ def _vendor_of(text: str | None) -> str | None: # chain like Adobe over a Gemini original) legitimately names several vendors in # one valid chain and must not read as spoofing. Families not listed here are each # their own independent source (EXIF/XMP generator, IPTC AISystemUsed, AIGC, ...). -_CLASH_SOURCE: dict[str, str] = {"c2pa": "c2pa_manifest", "synthid": "c2pa_manifest"} +# The single C2PA-manifest source shared by the issuer attribution and the SynthID +# proxy (both inferred from the same embedded manifest). Rule 2 keys off it too: +# the camera device label is read from this manifest, so an AI marker is a clash +# only when its source differs from this (i.e. it is genuinely independent). +_C2PA_MANIFEST_SOURCE = "c2pa_manifest" +_CLASH_SOURCE: dict[str, str] = {"c2pa": _C2PA_MANIFEST_SOURCE, "synthid": _C2PA_MANIFEST_SOURCE} def _integrity_clashes( @@ -326,7 +339,16 @@ def _integrity_clashes( + " -- one provenance set was likely spoofed, transplanted, or laundered." ) - if camera_label and camera_has_ai_marker: + # Rule 2: a camera-capture C2PA device next to an AI-generation marker. Only + # an AI marker from a source INDEPENDENT of the camera's own C2PA manifest is + # a contradiction. A device that both captures and runs on-device generative + # AI (Google Pixel Magic Editor / Pixel Studio) records the capture and the + # AI edit in ONE manifest, so the AI vendor is named only from that same + # manifest (c2pa issuer + synthid proxy) -- a legitimate edit chain, not a + # spoof. An EXIF/XMP generator, IPTC field, TC260 AIGC label, or second + # manifest naming AI on a camera capture is the real laundering tell. + independent_ai_marker = any(grp != _C2PA_MANIFEST_SOURCE for grp in source.values()) + if camera_label and camera_has_ai_marker and independent_ai_marker: vendors = ", ".join(sorted(set(ai_vendors.values()))) or "present" clashes.append( f"Camera-capture C2PA credentials ({camera_label}) coexist with AI-generation markers " @@ -483,6 +505,21 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b if c2pa_is_ai and (v := (_vendor_of(_attribute_platform(issuers, is_ai=True)) or _vendor_of(generator))): ai_vendor_claims["c2pa"] = v + # ── C2PA cloud-manifest reference (Durable Content Credentials) ─ + # An XMP dcterms:provenance pointer to a vendor manifest store survives even + # when the embedded manifest is stripped, so the credentials stay recoverable + # server-side (C2PA 2.4). Provenance only -- it does NOT assert AI (the cloud + # manifest may describe a human edit), so it is excluded from ai_from_metadata + # and the clash vendors. Skip when an embedded manifest already attributed it. + if not has_c2pa and (cloud_vendor := c2pa_cloud_manifest_in(head)): + signals.append(Signal("c2pa_cloud", f"cloud manifest store: {cloud_vendor}", "medium")) + watermarks.append( + f"C2PA Durable Content Credentials (cloud manifest at {cloud_vendor}; embedded manifest absent)" + ) + caveats.append(_C2PA_CLOUD_CAVEAT) + if platform is None: + platform = f"C2PA signer: {cloud_vendor} (cloud manifest)" + # ── SynthID metadata proxy ────────────────────────────────────── # get_ai_metadata already sets synthid_watermark for both PNG (caBX parser) # and non-PNG (its own synthid_source fallback), so no extra scan is needed. diff --git a/src/remove_ai_watermarks/metadata.py b/src/remove_ai_watermarks/metadata.py index a91d691..e429c3e 100644 --- a/src/remove_ai_watermarks/metadata.py +++ b/src/remove_ai_watermarks/metadata.py @@ -343,13 +343,16 @@ def aigc_label(image_path: Path) -> dict[str, str] | None: found by a container-agnostic raw-byte scan (PNG/JPEG/WebP alike); and - a raw-JSON ``{"AIGC":{...}}`` block with no namespace, as embedded in JPEG EXIF (UserComment) by some China-served generators, brace-matched from the - scan head. + scan head; and + - a bare ``AIGC{...}`` blob (the label glued straight to its JSON, no + ``"AIGC":`` key wrapper) embedded in a JPEG APP segment near the JFIF + header by some China-served generators. Returns the decoded JSON (e.g. ``{"Label": "1", "ContentProducer": ...}``) - or None. The generic forms (the PNG-chunk key ``AIGC`` and the bare - ``{"AIGC":...}`` object) are accepted only if they carry at least one known - TC260 field (``_TC260_FIELDS``); the namespaced XMP element is unambiguous, - so any JSON object is accepted. + or None. The generic forms (the PNG-chunk key ``AIGC``, the bare + ``{"AIGC":...}`` object, and the bare ``AIGC{...}`` blob) are accepted only + if they carry at least one known TC260 field (``_TC260_FIELDS``); the + namespaced XMP element is unambiguous, so any JSON object is accepted. """ import html import json @@ -393,24 +396,76 @@ def aigc_label(image_path: Path) -> dict[str, str] | None: body = match.group(1) if match.group(1) is not None else match.group(2) return _parse(html.unescape(body.decode("utf-8", "replace")), require_tc260_field=False) - # Raw-JSON {"AIGC":{...}} block (no namespace), as written into JPEG EXIF - # (UserComment) by some China-served generators -- the PNG-chunk and XMP - # paths above both miss it. The bytes pre-check keeps the common (no-AIGC) - # path off the full-buffer decode; raw_decode then brace-matches the inner - # object (respecting nested braces / quoted strings) and `_parse` applies the - # same dict coercion + TC260-field gate as the generic PNG-chunk path. - if b'"AIGC"' in data: - text = data.decode("latin-1") - brace = text.find("{", text.find('"AIGC"') + len('"AIGC"')) - if brace != -1: - try: - _, end = json.JSONDecoder().raw_decode(text, brace) - except ValueError: - return None - return _parse(text[brace:end], require_tc260_field=True) + # Generic raw-JSON forms the PNG-chunk and XMP paths above both miss, each + # gated on a TC260 field: the ``"AIGC":{...}`` key wrapper (as written into + # JPEG EXIF UserComment) and the bare ``AIGC{...}`` blob (the label glued + # straight to its JSON, no key wrapper, in a JPEG APP segment near the JFIF + # header). `raw_decode` brace-matches the inner object (respecting nested + # braces / quoted strings); `_parse` applies the same dict coercion + TC260 + # gate as the PNG-chunk path. A non-matching hit (no TC260 field, or an + # undecodable brace) must FALL THROUGH to the next form, never short-circuit: + # a quoted ``"AIGC"`` can appear later in an XMP packet while the real label + # is a bare ``AIGC{...}`` blob earlier in the file, so an unconditional return + # on the quoted form would shadow the bare form. + text = data.decode("latin-1") + for needle in ('"AIGC"', "AIGC{"): + start = text.find(needle) + if start == -1: + continue + # First brace at/after the needle: the object brace for ``"AIGC":{`` and + # the glued brace (at start+4) for the bare ``AIGC{`` -- one search covers both. + brace = text.find("{", start) + if brace == -1: + continue + try: + _, end = json.JSONDecoder().raw_decode(text, brace) + except ValueError: + continue + if result := _parse(text[brace:end], require_tc260_field=True): + return result return None +# C2PA "Durable Content Credentials" manifest repositories (C2PA 2.4). When the +# embedded manifest is stripped, an XMP ``dcterms:provenance`` URL can still point +# at the vendor's cloud manifest store, from which the credentials are recoverable +# server-side via the file's soft binding. Host -> vendor label. Verified on real +# files: Adobe's Content Authenticity cloud store. +_C2PA_MANIFEST_REPOSITORIES: tuple[tuple[bytes, str], ...] = ( + (b"cai-manifests.adobe.com", "Adobe Content Authenticity"), +) + + +def c2pa_cloud_manifest_in(data: bytes) -> str | None: + """Return a C2PA cloud-manifest vendor label if ``data`` carries an XMP + ``dcterms:provenance`` pointer to a known manifest repository, else None. + + The shared byte-scan (mirroring ``soft_binding_vendors_in``), so a caller that + already holds the scan head (``identify``) reuses it instead of re-reading. + """ + if b"dcterms:provenance" not in data: + return None + for host, vendor in _C2PA_MANIFEST_REPOSITORIES: + if host in data: + return vendor + return None + + +def c2pa_cloud_manifest(image_path: Path) -> str | None: + """Return a C2PA cloud-manifest vendor label if the file carries only an XMP + ``dcterms:provenance`` pointer to a manifest repository (C2PA 2.4 Durable + Content Credentials), else None. + + This fires on the laundering case where the *embedded* manifest was stripped + but the XMP cloud reference survives, so the Content Credentials remain + recoverable server-side. It is provenance, NOT an AI assertion: the cloud + manifest can describe a human edit as easily as an AI generation, and reading + its contents needs a network fetch we do not do. ``identify`` surfaces it as a + provenance signal without setting ``is_ai_generated``. + """ + return c2pa_cloud_manifest_in(scan_head(image_path, _QUICK_SCAN_BYTES)) + + def huggingface_job(image_path: Path) -> str | None: """Return the HuggingFace job id if the image carries an ``hf-job-id`` PNG text chunk, else None. diff --git a/tests/test_identify.py b/tests/test_identify.py index 47f8103..6d57689 100644 --- a/tests/test_identify.py +++ b/tests/test_identify.py @@ -772,6 +772,31 @@ class TestIntegrityClashesHelper: # must NOT raise a clash. assert _integrity_clashes({}, "Leica (camera, C2PA capture)", camera_has_ai_marker=False) == [] + def test_pixel_generative_edit_same_manifest_no_clash(self): + # A Google Pixel that BOTH captures and runs on-device generative AI + # (Magic Editor / Pixel Studio) records the capture and the AI edit in + # ONE C2PA manifest -- the AI vendor is named only from that same + # manifest (c2pa / synthid), independent of nothing. That is a legitimate + # edit chain, NOT a camera-vs-AI contradiction, so rule 2 must stay quiet. + assert ( + _integrity_clashes( + {"c2pa": "Google", "synthid": "Google"}, + "Google Pixel (camera, C2PA capture)", + camera_has_ai_marker=True, + ) + == [] + ) + + def test_camera_plus_independent_ai_marker_still_clashes(self): + # But a camera capture next to an AI marker from a genuinely INDEPENDENT + # source (EXIF/XMP generator, TC260 AIGC, ...) is still a laundering tell. + clashes = _integrity_clashes( + {"c2pa": "Google", "aigc": "China AIGC (TC260)"}, + "Google Pixel (camera, C2PA capture)", + camera_has_ai_marker=True, + ) + assert any("Camera-capture" in c for c in clashes) + class TestIntegrityClashEndToEnd: def _c2pa_jpeg(self, tmp_path: Path, blob: bytes) -> Path: @@ -806,6 +831,22 @@ class TestIntegrityClashEndToEnd: assert r.platform == "Google Pixel (camera, C2PA capture)" assert any("Camera-capture C2PA credentials" in c and "AI-generation markers" in c for c in r.integrity_clashes) + def test_pixel_generative_edit_no_clash(self, tmp_path: Path): + # A real Google Pixel generative edit (Magic Editor / Pixel Studio) signs + # ONE manifest carrying both the Pixel Camera capture and a Google + # Generative AI edit (trainedAlgorithmicMedia + "Applied imperceptible + # SynthID watermark"). The AI marker lives in the SAME manifest as the + # device, so it is an edit chain, not a camera-vs-AI contradiction. + path = self._c2pa_jpeg( + tmp_path, + b"Pixel Camera ... Created by Pixel Camera ... computationalCapture ... " + b"Created by Google Generative AI ... trainedAlgorithmicMedia ... " + b"Applied imperceptible SynthID watermark", + ) + r = identify(path, check_visible=False, check_invisible=False) + assert r.is_ai_generated is True + assert r.integrity_clashes == [] + def test_clash_serializes_to_json(self, tmp_path: Path): path = self._c2pa_jpeg(tmp_path, b"OpenAI ... trainedAlgorithmicMedia ... TC260:AIGC label") r = identify(path, check_visible=False, check_invisible=False) diff --git a/tests/test_metadata.py b/tests/test_metadata.py index bfc146e..623812c 100644 --- a/tests/test_metadata.py +++ b/tests/test_metadata.py @@ -790,6 +790,42 @@ class TestAIGCLabel: def test_has_ai_metadata_detects_raw_json_exif_form(self, tmp_path: Path): assert has_ai_metadata(self._aigc_exif_jpeg(tmp_path)) + def _aigc_bare_jpeg(self, tmp_path: Path, producer: str = "00119144030008867405X210002") -> Path: + """Some China-served generators glue the TC260 label straight to its JSON + as a bare ``AIGC{...}`` blob inside a JPEG APP segment (no ``"AIGC":`` + key wrapper, no PNG chunk, no namespaced XMP) -- seen near the JFIF + header on real 2026-06 downloads.""" + p = tmp_path / "aigc_bare.jpg" + Image.new("RGB", (32, 32)).save(p) + raw = p.read_bytes() + blob = b'AIGC{"Label":"1","ContentProducer":"' + producer.encode() + b'","ProduceID":"8F995586"}' + segment = b"\xff\xe9" + (len(blob) + 2).to_bytes(2, "big") + blob # APP9 + p.write_bytes(raw[:2] + segment + raw[2:]) # splice after SOI + return p + + def test_parses_bare_aigc_jpeg_segment_form(self, tmp_path: Path): + from remove_ai_watermarks.metadata import aigc_label + + info = aigc_label(self._aigc_bare_jpeg(tmp_path)) + assert info is not None + assert info["Label"] == "1" + assert info["ContentProducer"] == "00119144030008867405X210002" + + def test_has_ai_metadata_detects_bare_aigc_jpeg_form(self, tmp_path: Path): + assert has_ai_metadata(self._aigc_bare_jpeg(tmp_path)) + + def test_bare_aigc_without_tc260_field_ignored(self, tmp_path: Path): + """A bare ``AIGC{...}`` blob with no TC260 field must not false-positive.""" + from remove_ai_watermarks.metadata import aigc_label + + p = tmp_path / "bare_unrelated.jpg" + Image.new("RGB", (32, 32)).save(p) + raw = p.read_bytes() + blob = b'AIGC{"unrelated":"value"}' + segment = b"\xff\xe9" + (len(blob) + 2).to_bytes(2, "big") + blob + p.write_bytes(raw[:2] + segment + raw[2:]) + assert aigc_label(p) is None + def test_raw_json_without_tc260_field_ignored(self, tmp_path: Path): """A bare ``{"AIGC":{...}}`` object with no TC260 field must not fire.""" import json @@ -1185,3 +1221,49 @@ class TestFfmpegMetadataStrip: remove_ai_metadata(src, out) assert out.exists() assert b"Suno AI generated" not in out.read_bytes() # tag stripped, audio kept + + +class TestC2paCloudManifest: + """C2PA 2.4 Durable Content Credentials: an XMP dcterms:provenance pointer to + a vendor cloud manifest store survives when the embedded manifest is stripped.""" + + def _cloud_png(self, tmp_path: Path, host: bytes = b"cai-manifests.adobe.com") -> Path: + xmp = ( + b'' + b'' + b' ' + b'' + ) + p = tmp_path / "cloud.png" + img = Image.new("RGB", (16, 16)) + meta = PngInfo() + meta.add_itxt("XML:com.adobe.xmp", xmp.decode("latin-1")) + img.save(p, pnginfo=meta) + return p + + def test_detects_adobe_cloud_manifest(self, tmp_path: Path): + from remove_ai_watermarks.metadata import c2pa_cloud_manifest + + assert c2pa_cloud_manifest(self._cloud_png(tmp_path)) == "Adobe Content Authenticity" + + def test_no_provenance_pointer_is_none(self, tmp_clean_png: Path): + from remove_ai_watermarks.metadata import c2pa_cloud_manifest + + assert c2pa_cloud_manifest(tmp_clean_png) is None + + def test_unknown_host_is_none(self, tmp_path: Path): + from remove_ai_watermarks.metadata import c2pa_cloud_manifest + + # A dcterms:provenance pointer to an unrecognized host is not attributed. + assert c2pa_cloud_manifest(self._cloud_png(tmp_path, host=b"manifests.example.com")) is None + + def test_cloud_manifest_does_not_assert_ai(self, tmp_path: Path): + # Provenance only -- a cloud manifest can describe a human edit, so the + # verdict must stay 'unknown', not 'AI-generated'. + from remove_ai_watermarks.identify import identify + + r = identify(self._cloud_png(tmp_path), check_visible=False, check_invisible=False) + assert r.is_ai_generated is None + assert any("Durable Content Credentials" in w for w in r.watermarks) + assert any(s.name == "c2pa_cloud" for s in r.signals)