feat(identify): detect visible Doubao/Jimeng marks; keep identify import torch-free

identify previously ran only the Gemini sparkle as a visible detector, so a Doubao/Jimeng image with stripped TC260 metadata had no visible fallback. Add `_visible_text_marks` (registry-backed) so the ByteDance Doubao 豆包AI生成 and Jimeng 即梦AI marks are detected too, each gated by its own engine NCC threshold via MarkDetection.detected. New signals `visible_doubao` / `visible_jimeng` (medium), same stripped-metadata fallback role as the sparkle; excluded from integrity-clash vendor claims; set platform only when no harder signal did. Also make `noai/__init__` lazy (PEP 562 __getattr__): importing the light `noai.c2pa` / `noai.constants` submodules (which identify needs) no longer eagerly pulls `watermark_remover`, which imports torch + diffusers at module top. `import remove_ai_watermarks.identify` drops from ~420 MB to ~21 MB in a full gpu/detect install (torch not loaded), so it fits a 512 MB host; the removal API resolves lazily on first access. Guarded by TestIdentifyImportIsLight. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-20 14:40:52 +02:00 · 2026-05-31 20:43:52 -07:00
parent 4b4049a6f1
commit e501bec9ff
5 changed files with 153 additions and 9 deletions
@@ -26,7 +26,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu
 - **Smart Face Protection** — automatic extraction and blending of human faces to prevent AI distortion
 - **Batch processing** — process entire directories
 - **Detection** — three-stage NCC watermark detection with confidence scoring
- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible sparkle, the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)
+- **Provenance detection (`identify`)** — aggregate C2PA issuer, the C2PA soft-binding forensic-watermark vendor (Adobe TrustMark, Digimarc, Imatag, ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, embedded SD/ComfyUI params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" text marks), the open SD/SDXL/FLUX invisible watermark, and (with the `trustmark` extra) the open Adobe TrustMark watermark into one origin-platform + watermark-inventory verdict (`--json` for machine output)

 ## Examples

@@ -58,7 +58,7 @@ If this tool saves you time, consider [sponsoring its development](https://githu

 > Visible overlays are used by Google Gemini / Nano Banana (sparkle logo) and by ByteDance's Doubao ("豆包AI生成" corner text) and Jimeng / Dreamina ("★ 即梦AI" wordmark). All are removed on CPU by reverse-alpha against a captured alpha map (Jimeng adds a residual inpaint over the glyph footprint, since its mark re-rasterizes per image). Other services rely on invisible watermarks and/or metadata; our diffusion-based regeneration works against any invisible watermark in pixel or frequency domain. For a visible mark from any other source (any position, any colour), use the universal `erase --region` command.

-> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible sparkle, and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.
+> **Detection:** `remove-ai-watermarks identify <image>` reports the origin platform and watermark inventory for all the signals above — C2PA issuer, the C2PA soft-binding forensic-watermark vendor (TrustMark / Digimarc / Imatag / ...), IPTC "Made with AI" plus the IPTC 2025.1 `AISystemUsed` field, the China TC260 AIGC label (XMP, PNG chunk, or EXIF), the HuggingFace `hf-job-id` job marker, embedded generation params, EXIF/XMP generator tags, the xAI/Grok EXIF signature, the SynthID metadata proxy, the visible marks (Gemini sparkle plus the Doubao "豆包AI生成" / Jimeng "即梦AI" text marks), and (with the `[detect]` / `[trustmark]` extras) the open SD/SDXL/FLUX and Adobe TrustMark invisible watermarks. SynthID and the proprietary soft-binding watermarks (Digimarc etc.) have no local decoder, so they are reported by metadata proxy / vendor name only.

 ## How it works

@@ -8,7 +8,8 @@ Aggregates every locally-readable signal into a single :class:`ProvenanceReport`
 - **PNG text / EXIF generation parameters** (Stable Diffusion, ComfyUI, InvokeAI).
 - **SynthID metadata proxy** -- a C2PA companion from a SynthID-using vendor
  (Google / OpenAI) implies the invisible pixel watermark.
- **Visible Gemini sparkle** (optional; needs cv2/numpy, no GPU).
+- **Visible marks** (optional; needs cv2/numpy, no GPU): the Gemini sparkle and
+  the ByteDance Doubao 豆包AI生成 / Jimeng 即梦AI text marks.

 Hard limit: a stripped image (re-encoded, screenshotted, social-media upload)
 loses all metadata, and the SynthID *pixel* watermark is not locally decodable
@@ -43,6 +44,8 @@ from remove_ai_watermarks.noai.constants import C2PA_AI_TOOLS, C2PA_ISSUERS
 if TYPE_CHECKING:
    from pathlib import Path

+    from remove_ai_watermarks.watermark_registry import MarkDetection
+
 log = logging.getLogger(__name__)

 # How much of a non-PNG container to binary-scan for the C2PA issuer.
@@ -334,6 +337,46 @@ def _visible_sparkle(image_path: Path) -> float | None:
    return detect_sparkle_confidence(image_path)


+# Visible text marks (registry keys) -> human-readable platform, mirroring the
+# Gemini-sparkle phrasing. These are the stripped-metadata visual fallback for
+# the China-served ByteDance generators (normally also caught by the TC260 AIGC
+# metadata label); the per-engine detection thresholds live in the registry.
+_VISIBLE_MARK_PLATFORM = {
+    "doubao": "ByteDance Doubao (visible 豆包AI生成 mark detected)",
+    "jimeng": "ByteDance Jimeng / Dreamina (visible 即梦AI mark detected)",
+}
+
+
+def _visible_text_marks(image_path: Path) -> list[MarkDetection]:
+    """Detected visible Doubao/Jimeng marks (registry ``MarkDetection`` list).
+
+    The Gemini sparkle keeps its own ``_visible_sparkle`` path (file-level
+    confidence); these two text marks reuse the registry detectors, which apply
+    each engine's calibrated NCC threshold via ``MarkDetection.detected``.
+    Optional: needs cv2/numpy; returns ``[]`` if the engines/assets are missing
+    or the image can't be read.
+    """
+    try:
+        from remove_ai_watermarks.image_io import imread
+        from remove_ai_watermarks.watermark_registry import get_mark
+    except Exception as exc:  # cv2/engine assets missing
+        log.debug("visible-mark detectors unavailable: %s", exc)
+        return []
+    image = imread(image_path)
+    if image is None:
+        return []
+    detections: list[MarkDetection] = []
+    for key in _VISIBLE_MARK_PLATFORM:
+        try:
+            det = get_mark(key).detect(image)
+        except Exception as exc:  # one engine failing must not break identify
+            log.debug("visible-mark %s detector failed: %s", key, exc)
+            continue
+        if det.detected:
+            detections.append(det)
+    return detections
+
+
 def _invisible_watermark(image_path: Path) -> str | None:
    """Open invisible-watermark scheme name (SD/SDXL/FLUX) or None.

@@ -361,7 +404,8 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b

    Args:
        image_path: Path to the image (PNG, JPEG, WebP, or ISOBMFF container).
-        check_visible: Also run the visible Gemini-sparkle detector (cv2). Set
+        check_visible: Also run the visible-mark detectors (cv2) -- the Gemini
+            sparkle and the Doubao/Jimeng text marks from the registry. Set
            False for a pure-metadata, dependency-light scan.
        check_invisible: Also decode open invisible watermarks (SD/SDXL/FLUX) via
            the optional imwatermark library. No-op when it is not installed.
@@ -580,7 +624,16 @@ def identify(image_path: Path, *, check_visible: bool = True, check_invisible: b
        if platform is None:
            platform = "Google Gemini family (visible sparkle detected)"

-    visible_only = any(s.name == "visible_sparkle" for s in signals) and not ai_from_metadata
+    # ── Visible Doubao / Jimeng text marks (registry; same stripped-metadata
+    #    fallback role as the Gemini sparkle above) ─
+    if check_visible:
+        for det in _visible_text_marks(image_path):
+            signals.append(Signal(f"visible_{det.key}", f"NCC confidence {det.confidence:.2f}", "medium"))
+            watermarks.append(f"Visible {det.label} (confidence {det.confidence:.2f})")
+            if platform is None:
+                platform = _VISIBLE_MARK_PLATFORM[det.key]
+
+    visible_only = any(s.name.startswith("visible_") for s in signals) and not ai_from_metadata
    hf_only = bool(hf_job) and not ai_from_metadata
    samsung_only = samsung_genai_type is not None and not ai_from_metadata

@@ -1,9 +1,36 @@
 """Vendored noai-watermark code for invisible watermark removal.

 Original: https://github.com/mertizci/noai-watermark (MIT License)
+
+The public API (``WatermarkRemover`` / ``remove_watermark`` / ``remove_ai_metadata``)
+is exposed **lazily** via PEP 562 ``__getattr__``: importing a light submodule
+(e.g. ``noai.c2pa`` / ``noai.constants`` from ``identify``) must NOT eagerly pull
+``watermark_remover``, which imports torch + diffusers at module top. Keeping this
+lazy is what lets ``import remove_ai_watermarks.identify`` stay cheap (~36 MB, no
+torch) even in a full install where the ``gpu``/``detect`` extras are present --
+otherwise the mere presence of torch in the env inflated identify to ~420 MB and
+risked OOM on a 512 MB host.
 """

-from remove_ai_watermarks.noai.cleaner import remove_ai_metadata
-from remove_ai_watermarks.noai.watermark_remover import WatermarkRemover, remove_watermark
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from remove_ai_watermarks.noai.cleaner import remove_ai_metadata
+    from remove_ai_watermarks.noai.watermark_remover import WatermarkRemover, remove_watermark

 __all__ = ["WatermarkRemover", "remove_ai_metadata", "remove_watermark"]
+
+
+def __getattr__(name: str) -> object:
+    """Resolve the public API on first access (PEP 562), not at package import."""
+    if name == "remove_ai_metadata":
+        from remove_ai_watermarks.noai.cleaner import remove_ai_metadata
+
+        return remove_ai_metadata
+    if name in ("WatermarkRemover", "remove_watermark"):
+        from remove_ai_watermarks.noai import watermark_remover
+
+        return getattr(watermark_remover, name)
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
@@ -7,6 +7,8 @@ against the real committed C2PA / IPTC fixtures in data/samples/.
 from __future__ import annotations

 import json
+import subprocess
+import sys
 from dataclasses import asdict
 from pathlib import Path
 from unittest.mock import patch
@@ -371,6 +373,68 @@ class TestIdentifyVisibleSparkle:
        assert r.confidence == "high"


+class TestIdentifyImportIsLight:
+    """`import identify` must stay torch-free (lazy noai/__init__): the package
+    is deployed on a 512 MB host where eagerly pulling torch/diffusers OOMs."""
+
+    def test_import_identify_does_not_pull_torch(self):
+        # Only meaningful where torch is installed (the gpu/detect extra); on a
+        # core-only CI runner torch can't be in sys.modules anyway.
+        pytest.importorskip("torch")
+        code = "import sys, remove_ai_watermarks.identify; sys.exit(1 if 'torch' in sys.modules else 0)"
+        result = subprocess.run([sys.executable, "-c", code], capture_output=True, check=False)  # noqa: S603
+        assert result.returncode == 0, f"import identify pulled torch: {result.stderr.decode()[-500:]}"
+
+
+# Where the registry-backed Doubao/Jimeng visible detector resolves.
+_TEXT_MARKS_TARGET = "remove_ai_watermarks.identify._visible_text_marks"
+
+
+class TestIdentifyVisibleTextMarks:
+    """The visible Doubao/Jimeng marks are a stripped-metadata visual fallback,
+    parallel to the Gemini sparkle: each lifts an Unknown verdict to medium."""
+
+    @staticmethod
+    def _detection(key: str, label: str, conf: float):
+        from remove_ai_watermarks.watermark_registry import MarkDetection
+
+        return MarkDetection(key, label, "bottom-right", True, conf, (0, 0, 10, 10))
+
+    def test_doubao_promotes_to_medium(self, tmp_clean_png: Path):
+        det = self._detection("doubao", "Doubao 豆包AI生成 text", 0.8)
+        with patch(_SPARKLE_TARGET, return_value=None), patch(_TEXT_MARKS_TARGET, return_value=[det]):
+            r = identify(tmp_clean_png, check_visible=True)
+        assert r.is_ai_generated is True
+        assert r.confidence == "medium"
+        assert r.platform is not None
+        assert "Doubao" in r.platform
+        signal = next(s for s in r.signals if s.name == "visible_doubao")
+        assert signal.confidence == "medium"
+
+    def test_jimeng_promotes_to_medium(self, tmp_clean_png: Path):
+        det = self._detection("jimeng", "Jimeng 即梦AI wordmark", 0.9)
+        with patch(_SPARKLE_TARGET, return_value=None), patch(_TEXT_MARKS_TARGET, return_value=[det]):
+            r = identify(tmp_clean_png, check_visible=True)
+        assert r.is_ai_generated is True
+        assert r.confidence == "medium"
+        assert r.platform is not None
+        assert "Jimeng" in r.platform
+        assert any(s.name == "visible_jimeng" for s in r.signals)
+
+    def test_check_visible_false_skips_text_marks(self, tmp_clean_png: Path):
+        det = self._detection("doubao", "Doubao 豆包AI生成 text", 0.99)
+        with patch(_SPARKLE_TARGET, return_value=None), patch(_TEXT_MARKS_TARGET, return_value=[det]) as mock:
+            r = identify(tmp_clean_png, check_visible=False)
+        mock.assert_not_called()
+        assert not any(s.name == "visible_doubao" for s in r.signals)
+
+    def test_metadata_keeps_high_even_with_text_mark(self, tmp_png_with_ai_metadata: Path):
+        det = self._detection("doubao", "Doubao 豆包AI生成 text", 0.8)
+        with patch(_SPARKLE_TARGET, return_value=None), patch(_TEXT_MARKS_TARGET, return_value=[det]):
+            r = identify(tmp_png_with_ai_metadata, check_visible=True)
+        assert r.confidence == "high"
+
+
 # ── Caveats and serialization ───────────────────────────────────────