feat(identify): close 3 detector gaps found on the spaces corpus (06-05..06-11)

- AIGC: parse the bare ``AIGC{...}`` blob form (label glued to its JSON in a JPEG APP segment near the JFIF header), and scan both raw-JSON forms in one fall-through loop so a quoted ``"AIGC"`` later in an XMP packet no longer shadows a real bare label earlier in the file (3 files read unknown before). - Integrity clash rule 2: a camera device + an AI marker from the SAME C2PA manifest (Google Pixel Magic Editor / Pixel Studio edit chain) is a legitimate edit chain, not a contradiction. Fire only when the AI marker's source is independent of the camera's manifest; pure cameras (Leica/Sony/Nikon) are unaffected (2 Pixel files mis-flagged before). - New c2pa_cloud_manifest detector: surface a C2PA 2.4 Durable Content Credentials cloud-manifest reference (Adobe cai-manifests.adobe.com) as a medium provenance signal when the embedded manifest is stripped. Provenance only, never asserts is_ai (2 files read fully unknown before). identify reuses its already-loaded scan head for the cloud check (no second read). +7 tests; CLAUDE.md + README synced. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 07:57:50 +02:00 · 2026-06-11 09:28:15 -07:00
parent 22bc171806
commit ad7e4ee08b
6 changed files with 241 additions and 26 deletions
@@ -30,7 +30,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu
 - **Text and face preservation (default)** — the default pipeline is a canny ControlNet that keeps text and face structure sharp through the removal pass (without copying original pixels, so SynthID is still removed). Use `--pipeline sdxl` for plain SDXL img2img (lighter, no extra model download) on inputs without text or faces. Canny preserves face *structure*, not *identity* (the regenerated face drifts in likeness). The library does not ship a face-restore extra: every approach evaluated (GFPGAN-on-cleaned, PhotoMaker-V2, InstantID txt2img, InstantID img2img-on-cleaned) regenerated the face via SDXL and made the output look more AI-generated than the cleaned image. The cleaned controlnet output is the least-AI face state achievable without re-introducing SynthID.
 - **Batch processing** — process entire directories
 - **Detection** — three-stage NCC watermark detection with confidence scoring
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
+- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, EXIF, or JPEG segment), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the C2PA cloud-manifest reference (Adobe Durable Content Credentials, when the embedded manifest is stripped), the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)

 ## Examples

@@ -62,7 +62,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu

 > Visible overlays are used by Google Gemini / Nano Banana (sparkle logo), by ByteDance's Doubao ("豆包AI生成" corner text) and Jimeng / Dreamina ("★ 即梦AI" wordmark), and by Samsung Galaxy AI ("✦ Contenuti generati dall'AI" strip, bottom-left, locale-specific). All are removed on CPU by reverse-alpha against a captured alpha map (Jimeng and Samsung add a thin residual inpaint over the glyph footprint, since their marks re-rasterize per image). Other services rely on invisible watermarks and/or metadata; our diffusion-based regeneration works against any invisible watermark in pixel or frequency domain. For a visible mark from any other source (any position, any colour), use the universal `erase --region` command.

-> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.
+> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, EXIF, or JPEG segment), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the C2PA cloud-manifest reference (Adobe Durable Content Credentials, when the embedded manifest is stripped), the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" / Samsung Galaxy AI "Contenuti generati dall'AI" text marks), and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.

 ## How it works

@@ -30,6 +30,7 @@ from remove_ai_watermarks.metadata import (
    IPTC_AI_FIELD_MARKERS,
    IPTC_AI_MARKERS,
    aigc_label,
+    c2pa_cloud_manifest_in,
    c2pa_marker_in,
    exif_generator,
    get_ai_metadata,
@@ -96,6 +97,13 @@ _HF_JOB_CAVEAT = (
    "generation) but names neither the model nor the content type, so it is a "
    "medium-confidence signal, not proof the pixels are AI-generated."
 )
+_C2PA_CLOUD_CAVEAT = (
+    "The embedded C2PA manifest is absent but an XMP provenance pointer to the "
+    "vendor's cloud manifest store survives, so the Content Credentials remain "
+    "recoverable server-side -- stripping the file no longer removes the provenance. "
+    "It marks Content Credentials, not AI origin: the cloud manifest may describe a "
+    "human edit, and reading it needs a network fetch this tool does not make."
+)
 _SAMSUNG_GENAI_CAVEAT = (
    "Samsung's genAIType marker shows a Galaxy AI editing tool (Generative Edit, "
    "Sketch to Image, ...) touched the image; it is an undocumented proprietary "
@@ -285,7 +293,12 @@ def _vendor_of(text: str | None) -> str | None:
 # chain like Adobe over a Gemini original) legitimately names several vendors in
 # one valid chain and must not read as spoofing. Families not listed here are each
 # their own independent source (EXIF/XMP generator, IPTC AISystemUsed, AIGC, ...).
-_CLASH_SOURCE: dict[str, str] = {"c2pa": "c2pa_manifest", "synthid": "c2pa_manifest"}
+# The single C2PA-manifest source shared by the issuer attribution and the SynthID
+# proxy (both inferred from the same embedded manifest). Rule 2 keys off it too:
+# the camera device label is read from this manifest, so an AI marker is a clash
+# only when its source differs from this (i.e. it is genuinely independent).
+_C2PA_MANIFEST_SOURCE = "c2pa_manifest"
+_CLASH_SOURCE: dict[str, str] = {"c2pa": _C2PA_MANIFEST_SOURCE, "synthid": _C2PA_MANIFEST_SOURCE}


 def _integrity_clashes(
@@ -326,7 +339,16 @@ def _integrity_clashes(
            + " -- one provenance set was likely spoofed, transplanted, or laundered."
        )

-    if camera_label and camera_has_ai_marker:
+    # Rule 2: a camera-capture C2PA device next to an AI-generation marker. Only
+    # an AI marker from a source INDEPENDENT of the camera's own C2PA manifest is
+    # a contradiction. A device that both captures and runs on-device generative
+    # AI (Google Pixel Magic Editor / Pixel Studio) records the capture and the
+    # AI edit in ONE manifest, so the AI vendor is named only from that same
+    # manifest (c2pa issuer + synthid proxy) -- a legitimate edit chain, not a
+    # spoof. An EXIF/XMP generator, IPTC field, TC260 AIGC label, or second
+    # manifest naming AI on a camera capture is the real laundering tell.
+    independent_ai_marker = any(grp != _C2PA_MANIFEST_SOURCE for grp in source.values())
+    if camera_label and camera_has_ai_marker and independent_ai_marker:
        vendors = ", ".join(sorted(set(ai_vendors.values()))) or "present"
        clashes.append(
            f"Camera-capture C2PA credentials ({camera_label}) coexist with AI-generation markers "
@@ -483,6 +505,21 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
        if c2pa_is_ai and (v := (_vendor_of(_attribute_platform(issuers, is_ai=True)) or _vendor_of(generator))):
            ai_vendor_claims["c2pa"] = v

+    # ── C2PA cloud-manifest reference (Durable Content Credentials) ─
+    # An XMP dcterms:provenance pointer to a vendor manifest store survives even
+    # when the embedded manifest is stripped, so the credentials stay recoverable
+    # server-side (C2PA 2.4). Provenance only -- it does NOT assert AI (the cloud
+    # manifest may describe a human edit), so it is excluded from ai_from_metadata
+    # and the clash vendors. Skip when an embedded manifest already attributed it.
+    if not has_c2pa and (cloud_vendor := c2pa_cloud_manifest_in(head)):
+        signals.append(Signal("c2pa_cloud", f"cloud manifest store: {cloud_vendor}", "medium"))
+        watermarks.append(
+            f"C2PA Durable Content Credentials (cloud manifest at {cloud_vendor}; embedded manifest absent)"
+        )
+        caveats.append(_C2PA_CLOUD_CAVEAT)
+        if platform is None:
+            platform = f"C2PA signer: {cloud_vendor} (cloud manifest)"
+
    # ── SynthID metadata proxy ──────────────────────────────────────
    # get_ai_metadata already sets synthid_watermark for both PNG (caBX parser)
    # and non-PNG (its own synthid_source fallback), so no extra scan is needed.
@@ -343,13 +343,16 @@ def aigc_label(image_path: Path) -> dict[str, str] | None:
      found by a container-agnostic raw-byte scan (PNG/JPEG/WebP alike); and
    - a raw-JSON ``{"AIGC":{...}}`` block with no namespace, as embedded in JPEG
      EXIF (UserComment) by some China-served generators, brace-matched from the
-      scan head.
+      scan head; and
+    - a bare ``AIGC{...}`` blob (the label glued straight to its JSON, no
+      ``"AIGC":`` key wrapper) embedded in a JPEG APP segment near the JFIF
+      header by some China-served generators.

    Returns the decoded JSON (e.g. ``{"Label": "1", "ContentProducer": ...}``)
-    or None. The generic forms (the PNG-chunk key ``AIGC`` and the bare
-    ``{"AIGC":...}`` object) are accepted only if they carry at least one known
-    TC260 field (``_TC260_FIELDS``); the namespaced XMP element is unambiguous,
-    so any JSON object is accepted.
+    or None. The generic forms (the PNG-chunk key ``AIGC``, the bare
+    ``{"AIGC":...}`` object, and the bare ``AIGC{...}`` blob) are accepted only
+    if they carry at least one known TC260 field (``_TC260_FIELDS``); the
+    namespaced XMP element is unambiguous, so any JSON object is accepted.
    """
    import html
    import json
@@ -393,24 +396,76 @@ def aigc_label(image_path: Path) -> dict[str, str] | None:
        body = match.group(1) if match.group(1) is not None else match.group(2)
        return _parse(html.unescape(body.decode("utf-8", "replace")), require_tc260_field=False)

-    # Raw-JSON {"AIGC":{...}} block (no namespace), as written into JPEG EXIF
-    # (UserComment) by some China-served generators -- the PNG-chunk and XMP
-    # paths above both miss it. The bytes pre-check keeps the common (no-AIGC)
-    # path off the full-buffer decode; raw_decode then brace-matches the inner
-    # object (respecting nested braces / quoted strings) and `_parse` applies the
-    # same dict coercion + TC260-field gate as the generic PNG-chunk path.
-    if b'"AIGC"' in data:
-        text = data.decode("latin-1")
-        brace = text.find("{", text.find('"AIGC"') + len('"AIGC"'))
-        if brace != -1:
-            try:
-                _, end = json.JSONDecoder().raw_decode(text, brace)
-            except ValueError:
-                return None
-            return _parse(text[brace:end], require_tc260_field=True)
+    # Generic raw-JSON forms the PNG-chunk and XMP paths above both miss, each
+    # gated on a TC260 field: the ``"AIGC":{...}`` key wrapper (as written into
+    # JPEG EXIF UserComment) and the bare ``AIGC{...}`` blob (the label glued
+    # straight to its JSON, no key wrapper, in a JPEG APP segment near the JFIF
+    # header). `raw_decode` brace-matches the inner object (respecting nested
+    # braces / quoted strings); `_parse` applies the same dict coercion + TC260
+    # gate as the PNG-chunk path. A non-matching hit (no TC260 field, or an
+    # undecodable brace) must FALL THROUGH to the next form, never short-circuit:
+    # a quoted ``"AIGC"`` can appear later in an XMP packet while the real label
+    # is a bare ``AIGC{...}`` blob earlier in the file, so an unconditional return
+    # on the quoted form would shadow the bare form.
+    text = data.decode("latin-1")
+    for needle in ('"AIGC"', "AIGC{"):
+        start = text.find(needle)
+        if start == -1:
+            continue
+        # First brace at/after the needle: the object brace for ``"AIGC":{`` and
+        # the glued brace (at start+4) for the bare ``AIGC{`` -- one search covers both.
+        brace = text.find("{", start)
+        if brace == -1:
+            continue
+        try:
+            _, end = json.JSONDecoder().raw_decode(text, brace)
+        except ValueError:
+            continue
+        if result := _parse(text[brace:end], require_tc260_field=True):
+            return result
    return None


+# C2PA "Durable Content Credentials" manifest repositories (C2PA 2.4). When the
+# embedded manifest is stripped, an XMP ``dcterms:provenance`` URL can still point
+# at the vendor's cloud manifest store, from which the credentials are recoverable
+# server-side via the file's soft binding. Host -> vendor label. Verified on real
+# files: Adobe's Content Authenticity cloud store.
+_C2PA_MANIFEST_REPOSITORIES: tuple[tuple[bytes, str], ...] = (
+    (b"cai-manifests.adobe.com", "Adobe Content Authenticity"),
+)
+
+
+def c2pa_cloud_manifest_in(data: bytes) -> str | None:
+    """Return a C2PA cloud-manifest vendor label if ``data`` carries an XMP
+    ``dcterms:provenance`` pointer to a known manifest repository, else None.
+
+    The shared byte-scan (mirroring ``soft_binding_vendors_in``), so a caller that
+    already holds the scan head (``identify``) reuses it instead of re-reading.
+    """
+    if b"dcterms:provenance" not in data:
+        return None
+    for host, vendor in _C2PA_MANIFEST_REPOSITORIES:
+        if host in data:
+            return vendor
+    return None
+
+
+def c2pa_cloud_manifest(image_path: Path) -> str | None:
+    """Return a C2PA cloud-manifest vendor label if the file carries only an XMP
+    ``dcterms:provenance`` pointer to a manifest repository (C2PA 2.4 Durable
+    Content Credentials), else None.
+
+    This fires on the laundering case where the *embedded* manifest was stripped
+    but the XMP cloud reference survives, so the Content Credentials remain
+    recoverable server-side. It is provenance, NOT an AI assertion: the cloud
+    manifest can describe a human edit as easily as an AI generation, and reading
+    its contents needs a network fetch we do not do. ``identify`` surfaces it as a
+    provenance signal without setting ``is_ai_generated``.
+    """
+    return c2pa_cloud_manifest_in(scan_head(image_path, _QUICK_SCAN_BYTES))
+
+
 def huggingface_job(image_path: Path) -> str | None:
    """Return the HuggingFace job id if the image carries an ``hf-job-id`` PNG
    text chunk, else None.
@@ -772,6 +772,31 @@ class TestIntegrityClashesHelper:
        # must NOT raise a clash.
        assert _integrity_clashes({}, "Leica (camera, C2PA capture)", camera_has_ai_marker=False) == []

+    def test_pixel_generative_edit_same_manifest_no_clash(self):
+        # A Google Pixel that BOTH captures and runs on-device generative AI
+        # (Magic Editor / Pixel Studio) records the capture and the AI edit in
+        # ONE C2PA manifest -- the AI vendor is named only from that same
+        # manifest (c2pa / synthid), independent of nothing. That is a legitimate
+        # edit chain, NOT a camera-vs-AI contradiction, so rule 2 must stay quiet.
+        assert (
+            _integrity_clashes(
+                {"c2pa": "Google", "synthid": "Google"},
+                "Google Pixel (camera, C2PA capture)",
+                camera_has_ai_marker=True,
+            )
+            == []
+        )
+
+    def test_camera_plus_independent_ai_marker_still_clashes(self):
+        # But a camera capture next to an AI marker from a genuinely INDEPENDENT
+        # source (EXIF/XMP generator, TC260 AIGC, ...) is still a laundering tell.
+        clashes = _integrity_clashes(
+            {"c2pa": "Google", "aigc": "China AIGC (TC260)"},
+            "Google Pixel (camera, C2PA capture)",
+            camera_has_ai_marker=True,
+        )
+        assert any("Camera-capture" in c for c in clashes)
+

 class TestIntegrityClashEndToEnd:
    def _c2pa_jpeg(self, tmp_path: Path, blob: bytes) -> Path:
@@ -806,6 +831,22 @@ class TestIntegrityClashEndToEnd:
        assert r.platform == "Google Pixel (camera, C2PA capture)"
        assert any("Camera-capture C2PA credentials" in c and "AI-generation markers" in c for c in r.integrity_clashes)

+    def test_pixel_generative_edit_no_clash(self, tmp_path: Path):
+        # A real Google Pixel generative edit (Magic Editor / Pixel Studio) signs
+        # ONE manifest carrying both the Pixel Camera capture and a Google
+        # Generative AI edit (trainedAlgorithmicMedia + "Applied imperceptible
+        # SynthID watermark"). The AI marker lives in the SAME manifest as the
+        # device, so it is an edit chain, not a camera-vs-AI contradiction.
+        path = self._c2pa_jpeg(
+            tmp_path,
+            b"Pixel Camera ... Created by Pixel Camera ... computationalCapture ... "
+            b"Created by Google Generative AI ... trainedAlgorithmicMedia ... "
+            b"Applied imperceptible SynthID watermark",
+        )
+        r = identify(path, check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is True
+        assert r.integrity_clashes == []
+
    def test_clash_serializes_to_json(self, tmp_path: Path):
        path = self._c2pa_jpeg(tmp_path, b"OpenAI ... trainedAlgorithmicMedia ... TC260:AIGC label")
        r = identify(path, check_visible=False, check_invisible=False)
@@ -790,6 +790,42 @@ class TestAIGCLabel:
    def test_has_ai_metadata_detects_raw_json_exif_form(self, tmp_path: Path):
        assert has_ai_metadata(self._aigc_exif_jpeg(tmp_path))

+    def _aigc_bare_jpeg(self, tmp_path: Path, producer: str = "00119144030008867405X210002") -> Path:
+        """Some China-served generators glue the TC260 label straight to its JSON
+        as a bare ``AIGC{...}`` blob inside a JPEG APP segment (no ``"AIGC":``
+        key wrapper, no PNG chunk, no namespaced XMP) -- seen near the JFIF
+        header on real 2026-06 downloads."""
+        p = tmp_path / "aigc_bare.jpg"
+        Image.new("RGB", (32, 32)).save(p)
+        raw = p.read_bytes()
+        blob = b'AIGC{"Label":"1","ContentProducer":"' + producer.encode() + b'","ProduceID":"8F995586"}'
+        segment = b"\xff\xe9" + (len(blob) + 2).to_bytes(2, "big") + blob  # APP9
+        p.write_bytes(raw[:2] + segment + raw[2:])  # splice after SOI
+        return p
+
+    def test_parses_bare_aigc_jpeg_segment_form(self, tmp_path: Path):
+        from remove_ai_watermarks.metadata import aigc_label
+
+        info = aigc_label(self._aigc_bare_jpeg(tmp_path))
+        assert info is not None
+        assert info["Label"] == "1"
+        assert info["ContentProducer"] == "00119144030008867405X210002"
+
+    def test_has_ai_metadata_detects_bare_aigc_jpeg_form(self, tmp_path: Path):
+        assert has_ai_metadata(self._aigc_bare_jpeg(tmp_path))
+
+    def test_bare_aigc_without_tc260_field_ignored(self, tmp_path: Path):
+        """A bare ``AIGC{...}`` blob with no TC260 field must not false-positive."""
+        from remove_ai_watermarks.metadata import aigc_label
+
+        p = tmp_path / "bare_unrelated.jpg"
+        Image.new("RGB", (32, 32)).save(p)
+        raw = p.read_bytes()
+        blob = b'AIGC{"unrelated":"value"}'
+        segment = b"\xff\xe9" + (len(blob) + 2).to_bytes(2, "big") + blob
+        p.write_bytes(raw[:2] + segment + raw[2:])
+        assert aigc_label(p) is None
+
    def test_raw_json_without_tc260_field_ignored(self, tmp_path: Path):
        """A bare ``{"AIGC":{...}}`` object with no TC260 field must not fire."""
        import json
@@ -1185,3 +1221,49 @@ class TestFfmpegMetadataStrip:
        remove_ai_metadata(src, out)
        assert out.exists()
        assert b"Suno AI generated" not in out.read_bytes()  # tag stripped, audio kept
+
+
+class TestC2paCloudManifest:
+    """C2PA 2.4 Durable Content Credentials: an XMP dcterms:provenance pointer to
+    a vendor cloud manifest store survives when the embedded manifest is stripped."""
+
+    def _cloud_png(self, tmp_path: Path, host: bytes = b"cai-manifests.adobe.com") -> Path:
+        xmp = (
+            b'<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/">'
+            b'<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">'
+            b'<rdf:Description rdf:about="" xmlns:dcterms="http://purl.org/dc/terms/" '
+            b'dcterms:provenance="https://' + host + b'/manifests/urn-c2pa-abc123"> </rdf:Description>'
+            b'</rdf:RDF></x:xmpmeta><?xpacket end="w"?>'
+        )
+        p = tmp_path / "cloud.png"
+        img = Image.new("RGB", (16, 16))
+        meta = PngInfo()
+        meta.add_itxt("XML:com.adobe.xmp", xmp.decode("latin-1"))
+        img.save(p, pnginfo=meta)
+        return p
+
+    def test_detects_adobe_cloud_manifest(self, tmp_path: Path):
+        from remove_ai_watermarks.metadata import c2pa_cloud_manifest
+
+        assert c2pa_cloud_manifest(self._cloud_png(tmp_path)) == "Adobe Content Authenticity"
+
+    def test_no_provenance_pointer_is_none(self, tmp_clean_png: Path):
+        from remove_ai_watermarks.metadata import c2pa_cloud_manifest
+
+        assert c2pa_cloud_manifest(tmp_clean_png) is None
+
+    def test_unknown_host_is_none(self, tmp_path: Path):
+        from remove_ai_watermarks.metadata import c2pa_cloud_manifest
+
+        # A dcterms:provenance pointer to an unrecognized host is not attributed.
+        assert c2pa_cloud_manifest(self._cloud_png(tmp_path, host=b"manifests.example.com")) is None
+
+    def test_cloud_manifest_does_not_assert_ai(self, tmp_path: Path):
+        # Provenance only -- a cloud manifest can describe a human edit, so the
+        # verdict must stay 'unknown', not 'AI-generated'.
+        from remove_ai_watermarks.identify import identify
+
+        r = identify(self._cloud_png(tmp_path), check_visible=False, check_invisible=False)
+        assert r.is_ai_generated is None
+        assert any("Durable Content Credentials" in w for w in r.watermarks)
+        assert any(s.name == "c2pa_cloud" for s in r.signals)