feat(auto): DBNet text detector, Real-ESRGAN upscaler, batch --auto

Three content-quality features for the invisible/all/batch pipeline. DBNet text detector (auto_config): replace the MSER text heuristic with PP-OCRv3 differentiable-binarization via cv2.dnn.TextDetectionModel_DB, using a bundled 2.4 MB Apache-2.0 model (en/cn detection nets are byte-identical, so it ships language-neutral). cv2.dnn is core OpenCV, so no new pip dep. MSER stays as the fallback when the model can't load. Validated on real images: matches MSER everywhere and additionally catches the Doubao CJK mark MSER missed; routing decisions unchanged otherwise. Real-ESRGAN upscaler (new upscaler.py, esrgan extra): optional pre-diffusion super-resolution for the min-resolution floor upscale, loaded via spandrel (MIT, no basicsr) with BSD-3-Clause weights downloaded on first use. New --upscaler {lanczos,esrgan} on invisible/all/batch; default stays lanczos and the engine falls back to lanczos when the extra is absent or the model errors (never breaks removal). It is a manual opt-in knob (the auto plan never selects it) -- as a generic GAN it sharpens photo/texture content strongly but can degrade faces (the diffusion pass regenerates them) and thin text, documented accordingly. batch --auto: wire the content-adaptive --auto (+ --adaptive-polish) into cmd_batch. The plan is recomputed per image and the invisible engine is cached per resolved pipeline (default/controlnet), so a mixed directory builds at most one engine of each kind. Verified end-to-end: 3 mixed images routed correctly with only 2 pipeline loads (controlnet reused). ruff + strict pyright(src/) clean; 558 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-25 17:00:57 +02:00 · 2026-06-04 16:04:33 -07:00
parent 4a6cd71ab2
commit 6d11c11b52
13 changed files with 507 additions and 27 deletions
@@ -113,7 +113,7 @@ image → encode to latent space (VAE) at native resolution
      → decode back to pixels (VAE)
 ```

- Large inputs run at native resolution (no down-then-up round-trip, which was the main quality loss in issue #10); use `--max-resolution N` only to cap GPU/MPS memory on very large inputs. Small inputs (long side under 1024 px) are auto-upscaled to a 1024 px floor before diffusion, because SDXL distorts on a tiny latent, and the result is restored to the original size (a transparent quality boost). Disable the floor with `--min-resolution 0`.
+- Large inputs run at native resolution (no down-then-up round-trip, which was the main quality loss in issue #10); use `--max-resolution N` only to cap GPU/MPS memory on very large inputs. Small inputs (long side under 1024 px) are auto-upscaled to a 1024 px floor before diffusion, because SDXL distorts on a tiny latent, and the result is restored to the original size (a transparent quality boost). Disable the floor with `--min-resolution 0`. The floor upscale uses Lanczos by default; `--upscaler esrgan` (the `esrgan` extra) runs Real-ESRGAN first for sharper detail and falls back to Lanczos if the extra is absent. ESRGAN is a generic photo/texture GAN with no face/glyph prior, so it is best for photo/texture content -- it can degrade faces (the diffusion pass regenerates them, so the final recovers) and thin text; keep Lanczos for text-heavy inputs.

 > **Default strength is vendor-adaptive (no flag needed).** The tool reads the C2PA issuer to detect which vendor's SynthID is present and picks the strength that clears it with the least quality loss: **OpenAI gpt-image → `0.10`**, **Google Gemini → `0.15`**, **unknown source → `0.15`**. An oracle-verified June 2026 study (clean pipeline, per-image openai.com/verify or Gemini app) found OpenAI's watermark clears at `0.05` across `1024`-`1600` px (resolution-independent) while Google's is ~3x more robust and needs `0.15`. The dominant factor is the vendor, not resolution. There is no local SynthID detector, so if the oracle still reads SynthID, raise `--strength`; if you care more about preserving fine text, lower it. (Caveat: Google's `0.15` was validated on the capped `--max-resolution 1536` path; a very large native Gemini image may need more.)
 >
@@ -213,6 +213,14 @@ After installation the `remove-ai-watermarks` command is available system-wide.
 > ```bash
 > pip install -e ".[restore]"   # or: uv pip install -e ".[restore]"
 > ```
+>
+> For sharper upscaling of small inputs before diffusion (`--upscaler esrgan`,
+> Real-ESRGAN), install the `esrgan` extra. It loads via spandrel (MIT, no basicsr);
+> the Real-ESRGAN weights (BSD-3-Clause) download on first use:
+>
+> ```bash
+> pip install -e ".[esrgan]"   # or: uv pip install -e ".[esrgan]"
+> ```

 #### Invisible watermark removal

@@ -280,7 +288,8 @@ remove-ai-watermarks erase image.png --region 1640,1930,400,100 -o clean.png
 remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0 --unsharp 0.5
 # --humanize adds film grain, --unsharp counters the soft "AI" look (both opt-in).
 # Large images run at native resolution; small ones are upscaled to a 1024 floor
-# first (disable with --min-resolution 0). On a very large image that OOMs the
+# first (disable with --min-resolution 0); --upscaler esrgan uses Real-ESRGAN for
+# that floor upscale (needs the 'esrgan' extra). On a very large image that OOMs the
 # GPU/MPS, cap the long side: --max-resolution 2048
 # Strength is vendor-adaptive by default (OpenAI 0.10 / Google 0.15); override
 # with --strength. To preserve text/face structure, use --pipeline controlnet
@@ -301,6 +310,10 @@ remove-ai-watermarks metadata image.png --remove

 # Batch with a specific mode
 remove-ai-watermarks batch ./images/ --mode visible
+
+# Batch also accepts --auto (and --adaptive-polish): the plan is recomputed per
+# image, so a mixed directory routes each file to the right pipeline
+remove-ai-watermarks batch ./images/ --mode all --auto
 ```

 ### Python API
@@ -92,6 +92,19 @@ restore = [
    "scipy<1.18",
    "numba<0.60",
 ]
+# Optional pre-diffusion super-resolution for small inputs (Real-ESRGAN). Loaded via
+# spandrel (MIT) -- a pure model-loader with NO basicsr dependency (it pulls only
+# torch / torchvision / safetensors / numpy / einops), which sidesteps the
+# basicsr / torchvision.functional_tensor breakage that the `restore` extra fights.
+# The Real-ESRGAN weights (BSD-3-Clause) download on first use and are cached; they
+# are never bundled. CPU works but is slow on large inputs -- it is meant for the
+# pre-diffusion upscale of SMALL inputs (and the GPU worker). Guarded by
+# upscaler.is_available(); the default upscaler stays Lanczos (cv2, no deps). The
+# weights are fetched with torch.hub (bundled with spandrel's torch), so no extra
+# download dependency is needed.
+esrgan = [
+    "spandrel>=0.3.0",
+]
 dev = [
    "pytest>=8.0.0",
    "pytest-cov>=4.1.0",
@@ -17,14 +17,15 @@ text/graphics (already high-frequency, so almost no polish) and spares text/edge
 masking the grain.

 Detection is **cv2-only and torch-free**: OpenCV YuNet (``cv2.FaceDetectorYN``) for
-faces -- a 232 KB MIT-licensed model bundled in ``assets/`` -- plus a Canny
-edge-density + MSER region heuristic for text/structure. The whole planner peaks
-~100 MB RSS in a few ms, so it adds nothing meaningful to a GPU run and runs anywhere
-the pipeline runs.
+faces -- a 232 KB MIT-licensed model bundled in ``assets/`` -- DBNet (PP-OCRv3
+differentiable-binarization via ``cv2.dnn.TextDetectionModel_DB``, a 2.4 MB Apache-2.0
+model bundled in ``assets/``) for text, and a Canny ``edge_density``. The whole planner
+peaks ~100 MB RSS in a few ms, so it adds nothing meaningful to a GPU run and runs
+anywhere the pipeline runs.

-The text heuristic is a deliberately rough Phase-1 placeholder (DBNet via cv2.dnn is
-the planned precision upgrade); it only ever ADDS controlnet, so a miss is backstopped
-by the edge-density route and a false positive only costs a controlnet run.
+The text detector falls back to the old MSER region heuristic if the DBNet model can't
+load. Either way text only ever ADDS controlnet, so a miss is backstopped by the
+edge-density route and a false positive only costs a controlnet run.
 """

 # cv2/numpy boundary: cv2 ships no usable element types; relax the unknown-type rules
@@ -47,15 +48,29 @@ logger = logging.getLogger(__name__)
 # preserve). The headshot measures ~0.022, a busy photo higher; only a near-flat
 # gradient/solid image falls under 0.008.
 _STRUCTURELESS_EDGE_MAX = 0.008
-# MSER regions per megapixel above this -> likely text. Rough Phase-1 heuristic: a
-# no-text portrait measures a few hundred/MP, dense text far more. Set high so it
-# rarely false-fires; it only ever ADDS controlnet so miscalibration is low-harm.
+# MSER regions per megapixel above this -> likely text. The MSER path is now only the
+# FALLBACK when the bundled DBNet model can't load; DBNet (below) is the primary text
+# detector. Rough heuristic: a no-text portrait measures a few hundred/MP, dense text
+# far more. Set high so it rarely false-fires; text only ever ADDS controlnet.
 _TEXT_MSER_PER_MP = 1500.0
 _FACE_SCORE = 0.6  # YuNet confidence for a face to count
 # Downscale the long side to this for DETECTION only (faces stay detectable down to
-# ~10px, and this bounds YuNet/MSER cost on huge inputs). Removal runs at full res.
+# ~10px, and this bounds YuNet/DBNet/MSER cost on huge inputs). Removal runs at full res.
 _DETECT_MAX_SIDE = 1024

+# DBNet (PP-OCRv3 differentiable-binarization) text-region detector via cv2.dnn -- the
+# primary "has meaningful text" signal. The model is the shared PP-OCRv3 detection net
+# from OpenCV Zoo (Apache-2.0); en/cn variants are byte-identical, so it is bundled
+# language-neutral. cv2.dnn is core OpenCV, so this adds NO new pip dependency.
+_DBNET_ASSET = "text_detection_ppocrv3_2023may.onnx"  # Apache-2.0 (OpenCV Zoo PP-OCRv3 DB)
+_DBNET_BINARY_THRESHOLD = 0.3
+_DBNET_POLYGON_THRESHOLD = 0.5
+_DBNET_MAX_CANDIDATES = 200
+_DBNET_UNCLIP_RATIO = 2.0
+_DBNET_INPUT_SIDE = 736  # square input, multiple of 32 (PP-OCRv3 default)
+_DBNET_MEAN = (122.67891434, 116.66876762, 104.00698793)  # ImageNet mean * 255
+_dbnet: Any = None  # lazy singleton; set to False after a load failure (-> MSER fallback)
+
 # When a smoothing pass ran (controlnet or face restore), the adaptive polish
 # (humanizer.adaptive_polish) restores the input's detail level, sparing text --
 # replacing the old fixed unsharp/grain which over-/under-corrected and speckled text.
@@ -152,8 +167,41 @@ def detect_face(image: NDArray[Any]) -> bool:
    return faces is not None and len(faces) > 0


-def detect_text(image: NDArray[Any]) -> bool:
-    """Rough MSER-based text-presence heuristic (Phase-1 placeholder for DBNet)."""
+def _detect_text_dbnet(image: NDArray[Any]) -> bool | None:
+    """DBNet (PP-OCRv3) text-region presence via cv2.dnn.
+
+    Returns True/False on a successful run, or None if the bundled model can't load
+    (the caller then falls back to the MSER heuristic). Loads once, lazily.
+    """
+    import cv2
+
+    global _dbnet
+    if _dbnet is False:  # a prior load failed; skip straight to the MSER fallback
+        return None
+    img = _to_bgr(image)
+    h, w = img.shape[:2]
+    if h < 1 or w < 1:
+        return False
+    try:
+        if _dbnet is None:
+            model = Path(__file__).parent / "assets" / _DBNET_ASSET
+            net = cv2.dnn.TextDetectionModel_DB(str(model))
+            net.setBinaryThreshold(_DBNET_BINARY_THRESHOLD)
+            net.setPolygonThreshold(_DBNET_POLYGON_THRESHOLD)
+            net.setMaxCandidates(_DBNET_MAX_CANDIDATES)
+            net.setUnclipRatio(_DBNET_UNCLIP_RATIO)
+            net.setInputParams(1.0 / 255.0, (_DBNET_INPUT_SIDE, _DBNET_INPUT_SIDE), _DBNET_MEAN)
+            _dbnet = net
+        boxes, _ = _dbnet.detect(img)
+    except Exception as e:  # model load / inference can raise cv2.error or others
+        logger.debug("DBNet text detect failed (%s); falling back to MSER", e)
+        _dbnet = False
+        return None
+    return boxes is not None and len(boxes) > 0
+
+
+def _detect_text_mser(image: NDArray[Any]) -> bool:
+    """Fallback MSER-based text-presence heuristic (used only if DBNet can't load)."""
    import cv2

    gray = _to_gray(image)
@@ -166,6 +214,12 @@ def detect_text(image: NDArray[Any]) -> bool:
    return per_mp > _TEXT_MSER_PER_MP


+def detect_text(image: NDArray[Any]) -> bool:
+    """Text-presence: DBNet (cv2.dnn) when the bundled model loads, else the MSER heuristic."""
+    dbnet = _detect_text_dbnet(image)
+    return _detect_text_mser(image) if dbnet is None else dbnet
+
+
 def edge_density(image: NDArray[Any]) -> float:
    """Fraction of Canny edge pixels -- a cheap 'has structure' proxy in [0, 1]."""
    import cv2
@@ -190,9 +244,9 @@ def plan(image_path: Path) -> AutoConfig | None:

    h, w = image.shape[:2]
    small = _downscale_for_detection(image)
-    gray = _to_gray(small)  # convert once; the text/edge detectors pass a gray input through
+    gray = _to_gray(small)  # convert once; edge density + the MSER fallback use gray
    has_face = detect_face(small)  # YuNet needs the 3-channel image
-    has_text = detect_text(gray)
+    has_text = detect_text(small)  # DBNet wants BGR; the MSER fallback grays it internally
    edges = edge_density(gray)

    structureless = (not has_face) and (not has_text) and edges < _STRUCTURELESS_EDGE_MAX
@@ -159,6 +159,16 @@ _unsharp_option = click.option(
    "--unsharp", type=float, default=0.0, help="Unsharp-mask sharpening strength (0 = off, typical: 0.3-0.8)."
 )

+_upscaler_option = click.option(
+    "--upscaler",
+    type=click.Choice(["lanczos", "esrgan"]),
+    default="lanczos",
+    help="How to upscale a small input to the --min-resolution floor: lanczos (default, cv2, no deps) or "
+    "esrgan (Real-ESRGAN via the 'esrgan' extra; better detail, slower on CPU). Best for photo/texture "
+    "content -- as a generic GAN with no face/glyph prior it can degrade faces (diffusion mitigates) and "
+    "thin text, so lanczos stays the default. Falls back to lanczos if the extra is absent. Only when upscaling.",
+)
+
 _auto_option = click.option(
    "--auto",
    is_flag=True,
@@ -210,6 +220,21 @@ def _apply_auto(
    return pipeline, restore_faces, adaptive_polish


+def _warn_if_esrgan_unavailable(upscaler: str) -> None:
+    """Tell the user once if ``--upscaler esrgan`` will silently fall back to Lanczos.
+
+    The engine downgrades to Lanczos when the ``esrgan`` extra is absent (fail-safe, so
+    a batch never breaks mid-run) -- but without this notice the user would believe
+    Real-ESRGAN ran. Surfaced at the CLI layer, once per invocation (not per image).
+    """
+    if upscaler != "esrgan":
+        return
+    from remove_ai_watermarks import upscaler as _upscaler
+
+    if not _upscaler.is_available():
+        console.print("  Note: --upscaler esrgan needs the 'esrgan' extra; falling back to Lanczos.")
+
+
 def _restore_faces_options(f: Any) -> Any:
    """Attach the shared GFPGAN face-restoration flags to an invisible-pipeline command."""
    restore_flag = click.option(
@@ -557,6 +582,7 @@ def cmd_erase(
@_restore_faces_options
@_min_resolution_option
@_unsharp_option
+@_upscaler_option
@_auto_option
@_adaptive_polish_option
@click.pass_context
@@ -577,6 +603,7 @@ def cmd_invisible(
    controlnet_scale: float,
    restore_faces: bool,
    restore_faces_weight: float,
+    upscaler: str,
    auto: bool,
    adaptive_polish: bool,
 ) -> None:
@@ -596,6 +623,7 @@ def cmd_invisible(
    from remove_ai_watermarks.invisible_engine import InvisibleEngine

    source = _validate_image(source)
+    _warn_if_esrgan_unavailable(upscaler)
    if auto:
        pipeline, restore_faces, adaptive_polish = _apply_auto(ctx, source, pipeline, restore_faces, adaptive_polish)
    if output is None:
@@ -634,6 +662,7 @@ def cmd_invisible(
        adaptive_polish=adaptive_polish,
        max_resolution=max_resolution,
        min_resolution=min_resolution,
+        upscaler=upscaler,
        vendor=vendor,
        restore_faces=restore_faces,
        restore_faces_weight=restore_faces_weight,
@@ -815,6 +844,7 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
@_restore_faces_options
@_min_resolution_option
@_unsharp_option
+@_upscaler_option
@_auto_option
@_adaptive_polish_option
@click.pass_context
@@ -838,6 +868,7 @@ def cmd_all(
    controlnet_scale: float,
    restore_faces: bool,
    restore_faces_weight: float,
+    upscaler: str,
    auto: bool,
    adaptive_polish: bool,
 ) -> None:
@@ -854,6 +885,7 @@ def cmd_all(

    _banner()
    source = _validate_image(source)
+    _warn_if_esrgan_unavailable(upscaler)
    if auto:
        pipeline, restore_faces, adaptive_polish = _apply_auto(ctx, source, pipeline, restore_faces, adaptive_polish)

@@ -941,6 +973,7 @@ def cmd_all(
                adaptive_polish=adaptive_polish,
                max_resolution=max_resolution,
                min_resolution=min_resolution,
+                upscaler=upscaler,
                vendor=vendor,
                restore_faces=restore_faces,
                restore_faces_weight=restore_faces_weight,
@@ -1001,6 +1034,9 @@ def _process_batch_image(
    restore_faces: bool = False,
    restore_faces_weight: float = 0.5,
    controlnet_scale: float = 1.0,
+    upscaler: str = "lanczos",
+    auto: bool = False,
+    adaptive_polish: bool = False,
 ) -> None:
    """Process a single image for batch mode.

@@ -1046,14 +1082,22 @@ def _process_batch_image(
        if invisible_available():
            from remove_ai_watermarks.invisible_engine import InvisibleEngine

-            if "_inv_engine" not in ctx.obj:
-                ctx.obj["_inv_engine"] = InvisibleEngine(
+            # --auto re-plans the pipeline / face-restore / polish per image; only the
+            # pipeline choice changes the engine ctor, so cache one engine per pipeline
+            # (controlnet vs default) rather than a single shared instance.
+            if auto:
+                pipeline, restore_faces, adaptive_polish = _apply_auto(
+                    ctx, img_path, pipeline, restore_faces, adaptive_polish
+                )
+            engines = ctx.obj.setdefault("_inv_engines", {})
+            if pipeline not in engines:
+                engines[pipeline] = InvisibleEngine(
                    device=None if device == "auto" else device,
                    pipeline=pipeline,
                    hf_token=hf_token,
                    controlnet_conditioning_scale=controlnet_scale,
                )
-            engine_inv = ctx.obj["_inv_engine"]
+            engine_inv = engines[pipeline]
            engine_inv.remove_watermark(
                img_path if mode == "invisible" else out_path,
                out_path,
@@ -1062,8 +1106,10 @@ def _process_batch_image(
                seed=seed,
                humanize=humanize,
                unsharp=unsharp,
+                adaptive_polish=adaptive_polish,
                max_resolution=max_resolution,
                min_resolution=min_resolution,
+                upscaler=upscaler,
                restore_faces=restore_faces,
                restore_faces_weight=restore_faces_weight,
                # Detect the vendor from the pristine original (`img_path`), not the
@@ -1126,7 +1172,10 @@ def _process_batch_image(
@_restore_faces_options
@_min_resolution_option
@_unsharp_option
+@_upscaler_option
@_controlnet_scale_option
+@_auto_option
+@_adaptive_polish_option
@click.pass_context
 def cmd_batch(
    ctx: click.Context,
@@ -1147,6 +1196,9 @@ def cmd_batch(
    restore_faces: bool,
    restore_faces_weight: float,
    controlnet_scale: float,
+    upscaler: str,
+    auto: bool,
+    adaptive_polish: bool,
 ) -> None:
    """Process all images in a directory."""
    _banner()
@@ -1164,6 +1216,8 @@ def cmd_batch(
    console.print(f"  Found {len(images)} images in {directory}")
    console.print(f"  Output -> {output_dir}")
    console.print(f"  Mode: {mode}")
+    if mode in ("invisible", "all"):
+        _warn_if_esrgan_unavailable(upscaler)

    processed = 0
    errors = 0
@@ -1202,6 +1256,9 @@ def cmd_batch(
                    restore_faces=restore_faces,
                    restore_faces_weight=restore_faces_weight,
                    controlnet_scale=controlnet_scale,
+                    upscaler=upscaler,
+                    auto=auto,
+                    adaptive_polish=adaptive_polish,
                )
                processed += 1

@@ -126,6 +126,32 @@ class InvisibleEngine:
        """Eagerly load the pipeline so download progress is visible."""
        self._remover.preload()

+    def _esrgan_upscale(self, image: Any, target: tuple[int, int]) -> Any:
+        """Upscale a PIL image to ``target`` with Real-ESRGAN, else Lanczos.
+
+        Runs Real-ESRGAN at its native factor (on the remover's device, CPU fallback),
+        then resizes to the exact ``target`` with Lanczos. Falls back to a plain Lanczos
+        resize when the ``esrgan`` extra is absent or the model errors.
+        """
+        import cv2
+        import numpy as np
+        from PIL import Image
+
+        from remove_ai_watermarks import upscaler
+
+        if not upscaler.is_available():
+            logger.debug("esrgan upscaler requested but the extra is absent; using Lanczos")
+            return image.resize(target, Image.Resampling.LANCZOS)
+        try:
+            bgr = cv2.cvtColor(np.array(image.convert("RGB")), cv2.COLOR_RGB2BGR)
+            big = upscaler.upscale(bgr, device=self._remover.device)
+            if (big.shape[1], big.shape[0]) != target:
+                big = cv2.resize(big, target, interpolation=cv2.INTER_LANCZOS4)
+            return Image.fromarray(cv2.cvtColor(big, cv2.COLOR_BGR2RGB))
+        except Exception as e:  # never let an optional upscaler break removal
+            logger.warning("Real-ESRGAN upscale failed (%s); using Lanczos", e)
+            return image.resize(target, Image.Resampling.LANCZOS)
+
    def remove_watermark(
        self,
        image_path: Path,
@@ -142,6 +168,7 @@ class InvisibleEngine:
        restore_faces_weight: float = 0.5,
        unsharp: float = 0.0,
        adaptive_polish: bool = False,
+        upscaler: str = "lanczos",
    ) -> Path:
        """Remove invisible watermark from an image.

@@ -180,6 +207,11 @@ class InvisibleEngine:
                (default) = on; 0 = off. The output is restored to the original
                input size, so this is a transparent quality boost; it adds time
                and memory on small inputs. Ignored on a min > max misconfig.
+            upscaler: How to upscale a small input to the ``min_resolution`` floor:
+                ``"lanczos"`` (default, cv2, no deps) or ``"esrgan"`` (Real-ESRGAN
+                via the ``esrgan`` extra). Only applies when UPscaling (the floor
+                case); a ``max_resolution`` downscale always uses Lanczos. Falls back
+                to Lanczos if the extra is absent.

        Returns:
            Path to the cleaned image.
@@ -202,8 +234,8 @@ class InvisibleEngine:

        target = _target_size(image.width, image.height, max_resolution, min_resolution)
        if target is not None:
+            upscaling = max(target) > max(image.width, image.height)
            if self._progress_callback:
-                upscaling = max(target) > max(image.width, image.height)
                reason = (
                    f"min-resolution floor {min_resolution}px"
                    if upscaling
@@ -211,7 +243,12 @@ class InvisibleEngine:
                )
                verb = "Upscaling" if upscaling else "Downscaling"
                self._progress_callback(f"{verb} {image.width}x{image.height} to {target[0]}x{target[1]} ({reason})...")
-            image = image.resize(target, Image.Resampling.LANCZOS)
+            # Real-ESRGAN only helps when UPscaling (the floor case); a downscale cap
+            # always uses Lanczos. _esrgan_upscale falls back to Lanczos if the extra is absent.
+            if upscaling and upscaler == "esrgan":
+                image = self._esrgan_upscale(image, target)
+            else:
+                image = image.resize(target, Image.Resampling.LANCZOS)

        # Always persist to a temp file, even without downscaling: WatermarkRemover
        # reloads by path, so the EXIF-transposed pixels must be saved or rotation
@@ -0,0 +1,125 @@
+"""Optional pre-diffusion super-resolution for small inputs (Real-ESRGAN via spandrel).
+
+Mirrors ``region_eraser``'s optional-backend pattern: ``is_available()`` guards the
+``spandrel`` import, a lazy singleton (double-checked lock) holds the loaded model, and
+the weights download on first use (cached by ``torch.hub``) -- they are never bundled.
+
+The DEFAULT upscaler stays Lanczos (cv2, no deps); this is opt-in via the ``esrgan``
+extra and feeds the ``--upscaler esrgan`` path. ``spandrel`` is a pure model-loader
+(MIT) with NO basicsr dependency -- it pulls only torch/torchvision/safetensors/numpy/
+einops -- so it sidesteps the basicsr / ``torchvision.transforms.functional_tensor``
+breakage that the ``restore`` (GFPGAN) extra has to shim. Real-ESRGAN weights are
+BSD-3-Clause.
+
+CPU works but is slow on large inputs, so this is meant for the pre-diffusion upscale of
+SMALL inputs (and the GPU worker). On a memory-constrained host it is a no-op (the extra
+is absent), and the caller falls back to Lanczos.
+"""
+
+# torch/spandrel boundary: these libs ship no usable element types; relax the
+# unknown-type rules for this file only.
+# pyright: reportUnknownMemberType=false, reportUnknownArgumentType=false, reportUnknownVariableType=false, reportUnknownParameterType=false, reportMissingTypeArgument=false, reportMissingTypeStubs=false, reportMissingImports=false, reportArgumentType=false, reportAssignmentType=false, reportReturnType=false, reportCallIssue=false, reportIndexIssue=false, reportOperatorIssue=false, reportAttributeAccessIssue=false, reportPrivateImportUsage=false
+from __future__ import annotations
+
+import importlib.util
+import logging
+import threading
+from pathlib import Path
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:
+    from numpy.typing import NDArray
+
+logger = logging.getLogger(__name__)
+
+# Real-ESRGAN x2plus (BSD-3-Clause), official release. x2 is the right native factor for
+# the pre-diffusion floor upscale (small inputs ~512 -> ~1024); spandrel infers the
+# architecture and scale from the checkpoint, so swapping the URL is enough to change it.
+_MODEL_URL = "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth"
+_MODEL_FILENAME = "RealESRGAN_x2plus.pth"
+
+_model: Any = None  # lazy singleton (spandrel ImageModelDescriptor)
+_model_device: str = "cpu"
+_lock = threading.Lock()
+
+
+def is_available() -> bool:
+    """True if the ``esrgan`` extra (spandrel + torch) is importable."""
+    return importlib.util.find_spec("spandrel") is not None and importlib.util.find_spec("torch") is not None
+
+
+def _model_cache_path() -> Path:
+    """Path the weights are cached at (the torch.hub checkpoints dir)."""
+    import torch
+
+    cache_dir = Path(torch.hub.get_dir()) / "checkpoints"
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    return cache_dir / _MODEL_FILENAME
+
+
+def _get_model(device: str) -> Any:
+    """Load the Real-ESRGAN model once (downloading the weights on first use)."""
+    global _model, _model_device
+    if _model is not None and _model_device == device:
+        return _model
+    with _lock:
+        if _model is None:
+            import torch
+            from spandrel import ImageModelDescriptor, ModelLoader
+
+            dst = _model_cache_path()
+            if not dst.exists():
+                logger.info("Downloading Real-ESRGAN weights to %s", dst)
+                torch.hub.download_url_to_file(_MODEL_URL, str(dst), progress=False)
+            model = ModelLoader().load_from_file(str(dst))
+            if not isinstance(model, ImageModelDescriptor):
+                raise RuntimeError(f"Unexpected spandrel model type: {type(model).__name__}")
+            _model = model.eval()
+        if _model_device != device:
+            _model.to(device)
+            _model_device = device
+    return _model
+
+
+def scale() -> int:
+    """The model's native upscale factor (e.g. 2 for x2plus). Loads the model if needed."""
+    return int(_get_model("cpu").scale)
+
+
+def upscale(image: NDArray[Any], device: str | None = None) -> NDArray[Any]:
+    """Upscale a BGR uint8 image by the model's native factor with Real-ESRGAN.
+
+    Returns a BGR uint8 array. Falls back to CPU if the requested device errors (an
+    MPS/CUDA OOM or unsupported-op on the small pre-diffusion input), mirroring the
+    diffusion engine's MPS->CPU fallback.
+
+    Raises:
+        RuntimeError: if the ``esrgan`` extra is not installed (guard with
+            ``is_available()`` first).
+    """
+    if not is_available():
+        raise RuntimeError("Real-ESRGAN upscaler needs the 'esrgan' extra (spandrel). Install it or use Lanczos.")
+    import cv2
+    import numpy as np
+    import torch
+
+    target_device = (device or "cpu").lower()
+    if target_device not in {"cpu", "mps", "cuda", "xpu"}:
+        target_device = "cpu"
+    rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+    tensor = torch.from_numpy(rgb).permute(2, 0, 1).float().div(255.0).unsqueeze(0)
+
+    def _run(dev: str) -> NDArray[Any]:
+        model = _get_model(dev)
+        with torch.no_grad():
+            out = model(tensor.to(dev))
+        arr = out.clamp(0.0, 1.0).squeeze(0).permute(1, 2, 0).cpu().numpy() * 255.0
+        return cv2.cvtColor(arr.round().astype(np.uint8), cv2.COLOR_RGB2BGR)
+
+    try:
+        return _run(target_device)
+    except Exception as e:  # GPU OOM / unsupported op: fall back to CPU
+        if target_device == "cpu":
+            raise
+        logger.warning("Real-ESRGAN on %s failed (%s); retrying on CPU", target_device, e)
+        return _run("cpu")
@@ -34,6 +34,26 @@ class TestDetectors:
        cv2.putText(text, "HELLO AI TEXT", (10, 120), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 0, 0), 3)
        assert auto_config.edge_density(text) > auto_config.edge_density(blank)

+    def test_dbnet_detects_text_card(self):
+        """The bundled PP-OCRv3 DBNet model fires on a clear text card and not on flat."""
+        card = np.full((300, 500, 3), 255, dtype=np.uint8)
+        cv2.putText(card, "INVOICE TOTAL 1234", (10, 170), cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 0, 0), 4)
+        assert auto_config._detect_text_dbnet(card) is True
+        assert auto_config._detect_text_dbnet(np.full((300, 500, 3), 128, dtype=np.uint8)) is False
+
+    def test_detect_text_falls_back_to_mser_when_dbnet_unavailable(self, monkeypatch):
+        """If DBNet can't load (returns None), detect_text uses the MSER heuristic."""
+        monkeypatch.setattr(auto_config, "_detect_text_dbnet", lambda _img: None)
+        called = {}
+
+        def _fake_mser(_img):
+            called["mser"] = True
+            return True
+
+        monkeypatch.setattr(auto_config, "_detect_text_mser", _fake_mser)
+        assert auto_config.detect_text(np.full((100, 100, 3), 128, dtype=np.uint8)) is True
+        assert called.get("mser") is True
+

 class TestPlan:
    def test_unreadable_returns_none(self, tmp_path):
@@ -514,6 +514,45 @@ class TestBatchCommand:
        assert out[0, 0, 3] == 0
        assert out[100, 100, 3] == 255

+    def test_batch_auto_plans_pipeline_per_image(self, runner, tmp_path):
+        """--auto in batch re-plans the pipeline/restore/polish per image and
+        builds one engine per resolved pipeline."""
+        from remove_ai_watermarks import auto_config
+
+        input_dir = _make_batch_dir(tmp_path, count=2)
+        output_dir = tmp_path / "output"
+        plan = auto_config.AutoConfig(
+            pipeline="controlnet",
+            restore_faces=True,
+            adaptive_polish=True,
+            unsharp=0.0,
+            humanize=0.0,
+            min_resolution=1024,
+            has_face=True,
+            has_text=False,
+            edge_density=0.05,
+            width=200,
+            height=200,
+        )
+        mock_cls, mock_engine = _mock_invisible_engine()
+        with (
+            patch("remove_ai_watermarks.cli.InvisibleEngine", mock_cls, create=True),
+            patch("remove_ai_watermarks.invisible_engine.InvisibleEngine", mock_cls),
+            patch("remove_ai_watermarks.cli.invisible_available", return_value=True, create=True),
+            patch("remove_ai_watermarks.invisible_engine.is_available", return_value=True),
+            patch("remove_ai_watermarks.auto_config.plan", return_value=plan),
+        ):
+            result = runner.invoke(
+                main,
+                ["batch", str(input_dir), "-o", str(output_dir), "--mode", "invisible", "--auto"],
+            )
+        assert result.exit_code == 0, result.output
+        assert "2 processed" in result.output
+        # Engine built with the auto-resolved controlnet pipeline.
+        assert mock_cls.call_args.kwargs["pipeline"] == "controlnet"
+        # The auto plan's adaptive polish reached the engine call.
+        assert mock_engine.remove_watermark.call_args.kwargs["adaptive_polish"] is True
+
    def test_batch_default_output_dir(self, runner, tmp_path):
        input_dir = _make_batch_dir(tmp_path)
        result = runner.invoke(
@@ -101,3 +101,70 @@ class TestTargetSize:
        # min(1024) > max(800) is a misconfig: the floor must not upscale above the
        # cap, so it is skipped and the (within-cap) input stays native.
        assert _target_size(500, 400, 800, 1024) is None
+
+
+class TestEsrganUpscale:
+    """Branches of InvisibleEngine._esrgan_upscale (no diffusion model loaded).
+
+    A SimpleNamespace stands in for the engine so we exercise the helper without
+    constructing a real InvisibleEngine (which would load WatermarkRemover).
+    """
+
+    @staticmethod
+    def _fake_engine():
+        from types import SimpleNamespace
+
+        return SimpleNamespace(_remover=SimpleNamespace(device="cpu"))
+
+    @staticmethod
+    def _pil(w=120, h=80):
+        import numpy as np
+        from PIL import Image
+
+        return Image.fromarray(np.full((h, w, 3), 128, dtype=np.uint8))
+
+    def test_falls_back_to_lanczos_when_extra_absent(self, monkeypatch):
+        import numpy as np
+        from PIL import Image
+
+        from remove_ai_watermarks import upscaler
+
+        monkeypatch.setattr(upscaler, "is_available", lambda: False)
+        img = self._pil()
+        out = InvisibleEngine._esrgan_upscale(self._fake_engine(), img, (1024, 683))
+        assert out.size == (1024, 683)
+        # Identical to a plain Lanczos resize (the fallback path).
+        assert np.array_equal(np.asarray(out), np.asarray(img.resize((1024, 683), Image.Resampling.LANCZOS)))
+
+    def test_resizes_esrgan_output_to_exact_target(self, monkeypatch):
+        import cv2
+
+        from remove_ai_watermarks import upscaler
+
+        monkeypatch.setattr(upscaler, "is_available", lambda: True)
+
+        # Fake a 2x upscale that does NOT match the requested target; the helper must
+        # resize it to the exact target.
+        def _fake_upscale(bgr, device=None):
+            return cv2.resize(bgr, (bgr.shape[1] * 2, bgr.shape[0] * 2), interpolation=cv2.INTER_NEAREST)
+
+        monkeypatch.setattr(upscaler, "upscale", _fake_upscale)
+        out = InvisibleEngine._esrgan_upscale(self._fake_engine(), self._pil(), (1024, 683))
+        assert out.size == (1024, 683)
+
+    def test_falls_back_to_lanczos_when_upscale_raises(self, monkeypatch):
+        import numpy as np
+        from PIL import Image
+
+        from remove_ai_watermarks import upscaler
+
+        monkeypatch.setattr(upscaler, "is_available", lambda: True)
+
+        def _boom(bgr, device=None):
+            raise RuntimeError("model exploded")
+
+        monkeypatch.setattr(upscaler, "upscale", _boom)
+        img = self._pil()
+        out = InvisibleEngine._esrgan_upscale(self._fake_engine(), img, (512, 341))
+        assert out.size == (512, 341)
+        assert np.array_equal(np.asarray(out), np.asarray(img.resize((512, 341), Image.Resampling.LANCZOS)))
@@ -0,0 +1,32 @@
+"""Tests for the optional Real-ESRGAN upscaler (no model download).
+
+The model-running path is exercised manually (it downloads ~67 MB of BSD-3-Clause
+weights on first use); these tests cover the availability guard and the no-model
+control flow, mirroring the repo convention for ML-adjacent modules.
+"""
+
+from __future__ import annotations
+
+import numpy as np
+import pytest
+
+from remove_ai_watermarks import upscaler
+
+
+class TestIsAvailable:
+    def test_returns_bool(self):
+        assert isinstance(upscaler.is_available(), bool)
+
+
+class TestUpscaleGuard:
+    def test_raises_without_extra(self, monkeypatch):
+        monkeypatch.setattr(upscaler, "is_available", lambda: False)
+        with pytest.raises(RuntimeError, match="esrgan"):
+            upscaler.upscale(np.full((32, 32, 3), 128, dtype=np.uint8))
+
+
+class TestModelCachePath:
+    def test_cache_path_uses_model_filename(self):
+        if not upscaler.is_available():
+            pytest.skip("esrgan extra (torch) not installed")
+        assert upscaler._model_cache_path().name == upscaler._MODEL_FILENAME
@@ -3075,6 +3075,9 @@ dev = [
    { name = "pytest-cov" },
    { name = "ruff" },
 ]
+esrgan = [
+    { name = "spandrel" },
+]
 gpu = [
    { name = "accelerate" },
    { name = "diffusers" },
@@ -3125,12 +3128,13 @@ requires-dist = [
    { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.4.0" },
    { name = "safetensors", marker = "extra == 'gpu'" },
    { name = "scipy", marker = "extra == 'restore'", specifier = "<1.18" },
+    { name = "spandrel", marker = "extra == 'esrgan'", specifier = ">=0.3.0" },
    { name = "tokenizers", marker = "extra == 'gpu'", specifier = ">=0.22,<0.23" },
    { name = "torch", marker = "extra == 'gpu'", specifier = ">=2.0.0" },
    { name = "transformers", marker = "extra == 'gpu'", specifier = ">=5,<6" },
    { name = "trustmark", marker = "extra == 'trustmark'", specifier = ">=0.8.0" },
 ]
-provides-extras = ["gpu", "detect", "trustmark", "lama", "restore", "dev", "all"]
+provides-extras = ["gpu", "detect", "trustmark", "lama", "restore", "esrgan", "dev", "all"]

 [[package]]
 name = "requests"
@@ -3494,6 +3498,23 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/b7/ce/149a00dd41f10bc29e5921b496af8b574d8413afcd5e30dfa0ed46c2cc5e/six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274", size = 11050, upload-time = "2024-12-04T17:35:26.475Z" },
 ]

+[[package]]
+name = "spandrel"
+version = "0.4.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "einops" },
+    { name = "numpy" },
+    { name = "safetensors" },
+    { name = "torch" },
+    { name = "torchvision" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/2a/8f/ab4565c23dd67a036ab72101a830cebd7ca026b2fddf5771bbf6284f6228/spandrel-0.4.2.tar.gz", hash = "sha256:fefa4ea966c6a5b7721dcf24f3e2062a5a96a395c8bedcb570fb55971fdcbccb", size = 247544, upload-time = "2026-02-21T01:52:26.342Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/74/31/411ea965835534c43d4b98d451968354876e0e867ea1fd42669e4cca0732/spandrel-0.4.2-py3-none-any.whl", hash = "sha256:6c93e3ecbeb0e548fd2df45a605472b34c1614287c56b51bb33cdef7ae5235b5", size = 320811, upload-time = "2026-02-21T01:52:25.015Z" },
+]
+
 [[package]]
 name = "sympy"
 version = "1.14.0"