feat(invisible): sliding-window tiled diffusion for large inputs (--tile)

Add a lossless alternative to the --max-resolution downscale for large images that OOM on MPS/GPU: regenerate in overlapping, feather-blended tiles at native resolution. - noai/tiling.py: pure plan_tiles (uniform tiles, last flush to edge) + feather_weights (strictly-positive separable taper -> partition-of-unity blend) + run_tiled (per-tile generate callable, decoupled from the pipeline). Unit-tested without the model. - WatermarkRemover.remove_watermark: refactor _generate into _generate_one + a tiled branch that engages only when --tile is set and the long side exceeds tile_size (ControlNet canny is rebuilt per tile). - Thread tile/tile_size/tile_overlap through InvisibleEngine and the invisible/all/batch CLI commands via a shared _tile_options decorator. Verified end-to-end on the real SDXL pipeline (forced 2x2 tiling on a 1024px sample, MPS): non-degenerate output, no gross seam at tile borders. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-04 23:47:49 +02:00 · 2026-06-19 11:54:58 -07:00
parent d5845a72f3
commit 0c0c6c6b03
9 changed files with 395 additions and 10 deletions
@@ -18,13 +18,13 @@ Consequences for contributors (do not drift back into the stock niche just becau
 ## How to run

 - `uv run remove-ai-watermarks all <image.png> -o <output.png>` — full pipeline (visible + invisible + metadata). Same diffusion knobs as `invisible` below, plus the visible-pass `--inpaint/--no-inpaint`/`--inpaint-method`. **When the `[gpu]` extra is absent, step 2 (invisible/SynthID) is skipped** — `all` still writes an output (visible mark + metadata stripped) but prints a prominent end-of-run banner ("the invisible (SynthID) watermark was NOT removed") AND exits **non-zero** (1), so a skipped SynthID pass is not mistaken for a clean result (the recurring #14/#47 trap, where the old quiet inline warning was missed). `invisible` already hard-errors without the extra; only `all` continued, hence the loud end-banner. Regression-guarded by `tests/test_cli.py::TestAllCommand::test_all_loud_warning_and_nonzero_exit_when_gpu_missing`. **Test trap:** any `all` test that exercises the full pipeline MUST `patch("remove_ai_watermarks.invisible_engine.is_available", return_value=True)` — CI installs core+dev only (no `[gpu]`), so an unpatched `all` test takes the skip branch and now hits the non-zero exit. This passed locally (gpu present → `is_available()` True) but red-failed every matrix cell on the v0.11.0 commit (`test_all_basic`/`test_all_visible_step_uses_registry` asserted exit 0); both now patch `is_available` True.
- `uv run remove-ai-watermarks invisible <image.png> -o <out.png>` — diffusion SynthID removal. **Full knob set** (kept identical across `invisible`/`all`/`batch`): `--strength` (vendor-adaptive default), `--steps`, `--guidance-scale` (CFG, default 7.5), `--pipeline sdxl|controlnet` (default `controlnet`), `--controlnet-scale`, `--model` (HF model id, default SDXL base), `--device`, `--seed`, `--hf-token`, `--max-resolution`/`--min-resolution`, `--upscaler lanczos|esrgan`, `--humanize` (Analog Humanizer grain), `--unsharp` (final sharpen), and `--adaptive-polish/--no-adaptive-polish` (**ON by default**; detail-targeted polish that self-gates to a no-op where there is no deficit). `--auto` is deprecated and now a no-op that only warns (the polish it used to enable is ON by default).
+- `uv run remove-ai-watermarks invisible <image.png> -o <out.png>` — diffusion SynthID removal. **Full knob set** (kept identical across `invisible`/`all`/`batch`): `--strength` (vendor-adaptive default), `--steps`, `--guidance-scale` (CFG, default 7.5), `--pipeline sdxl|controlnet` (default `controlnet`), `--controlnet-scale`, `--model` (HF model id, default SDXL base), `--device`, `--seed`, `--hf-token`, `--max-resolution`/`--min-resolution`, `--upscaler lanczos|esrgan`, `--humanize` (Analog Humanizer grain), `--unsharp` (final sharpen), `--adaptive-polish/--no-adaptive-polish` (**ON by default**; detail-targeted polish that self-gates to a no-op where there is no deficit), and `--tile/--no-tile` + `--tile-size`/`--tile-overlap` (**OFF by default**; sliding-window tiled diffusion -- the *lossless* alternative to a `--max-resolution` downscale for large inputs that OOM on MPS/GPU. Engages only when the long side exceeds `--tile-size`, default 1024; tiles are feather-blended over `--tile-overlap` px, default 128. Pair with `--max-resolution 0`). `--auto` is deprecated and now a no-op that only warns (the polish it used to enable is ON by default).
 - `uv run remove-ai-watermarks visible <image.png> -o <out.png>` — known-visible-mark removal, CPU, no GPU. Reverse-alpha based: each mark is removed by inverting its captured alpha map. `--mark auto` (default) picks the strongest detected of the Gemini sparkle, the Doubao "豆包AI生成" text strip, the Jimeng "★ 即梦AI" wordmark, and the Samsung Galaxy AI "✦ Contenuti generati dall'AI" strip (bottom-LEFT, locale-specific — Italian variant calibrated); `--mark gemini` / `--mark doubao` / `--mark jimeng` / `--mark samsung` force one (choices come from the registry). Gemini/Doubao recover pixels exactly with no inpaint at native; **Jimeng and Samsung add an always-on thin residual inpaint over the glyph footprint** (their marks re-rasterize per image, so reverse-alpha alone leaves a faint outline). For arbitrary logos/objects use `erase`. **When `--mark auto` finds no known mark (the common case — ~74% of real uploads carry no registered visible mark), the command does NOT silently re-serve the input as a finished result.** It runs a cheap metadata-only `identify`, prints actionable guidance (if the image carries an invisible/metadata mark, e.g. an OpenAI/Gemini C2PA image, it points to `all`; otherwise it does NOT imply the image is clean -- it warns that an invisible pixel watermark like SynthID cannot be detected once the metadata proxy is gone and routes to both `all` and `erase --region`), writes NO output file, and exits **`EXIT_NO_VISIBLE_MARK` (2)** — distinct from success (0) and a hard error (1) so a wrapping service (raiw.cc) can surface the message instead of treating the unchanged image as done (the production "it didn't work" / score-0 trap). Same handling for an explicit `--mark <name>` that is not detected. Helper `cli._no_visible_mark_exit`; regression-guarded by `tests/test_cli.py::TestVisibleCommand::test_visible_auto_no_mark_exits_two_with_eraser_hint` and `test_visible_auto_no_mark_routes_to_all_when_metadata`. `--no-detect` still forces the gemini fallback and proceeds (exit 0).
 - `uv run remove-ai-watermarks erase <image.png> --region x,y,w,h -o <out.png>` — universal region eraser (any logo/object, any position). `--backend cv2` (default, no deps) or `--backend lama` (big-LaMa via onnxruntime, extra `lama`); `--region` is repeatable.
 - `uv run remove-ai-watermarks identify <image>` — provenance verdict (platform + watermark inventory + confidence); `--json` for machine output, `--no-visible` to skip the cv2 sparkle detector
 - `uv run remove-ai-watermarks metadata <image.png> --check` — inspect AI metadata (C2PA, EXIF, PNG chunks)
 - `uv run remove-ai-watermarks metadata <image.png> --remove -o <out.png>` — strip all AI metadata
- `uv run remove-ai-watermarks batch <directory>` — process every supported image in a directory (output defaults to `<directory>_clean/`, set with `-o`). `--mode visible|invisible|metadata|all` (default `visible`); the invisible/all path reuses the **full `invisible` knob set above** (`--strength`/`--steps`/`--guidance-scale`/`--pipeline`/`--controlnet-scale`/`--model`/`--device`/`--max-resolution`/`--min-resolution`/`--upscaler`/`--seed`/`--hf-token`/`--humanize`/`--unsharp`/`--adaptive-polish`), plus `--inpaint/--no-inpaint` for the visible pass. `--adaptive-polish` is ON by default; `--auto` is deprecated and a no-op that only warns. One engine cached per pipeline; the polish is resolved once before the loop.
+- `uv run remove-ai-watermarks batch <directory>` — process every supported image in a directory (output defaults to `<directory>_clean/`, set with `-o`). `--mode visible|invisible|metadata|all` (default `visible`); the invisible/all path reuses the **full `invisible` knob set above** (`--strength`/`--steps`/`--guidance-scale`/`--pipeline`/`--controlnet-scale`/`--model`/`--device`/`--max-resolution`/`--min-resolution`/`--upscaler`/`--seed`/`--hf-token`/`--humanize`/`--unsharp`/`--adaptive-polish`/`--tile`/`--tile-size`/`--tile-overlap`), plus `--inpaint/--no-inpaint` for the visible pass. `--adaptive-polish` is ON by default; `--auto` is deprecated and a no-op that only warns. One engine cached per pipeline; the polish is resolved once before the loop.

 ## Test and lint

@@ -62,6 +62,7 @@ Compact map. The full per-module detail (design decisions, tuned thresholds, cal
 - `invisible_watermark.py` — decodes the OPEN DWT-DCT watermarks (SD / SDXL / FLUX) via `imwatermark` (extra `detect`, pulls torch). Fragile two ways: (1) does not survive JPEG re-encode/resize; (2) **carrier-fragile on a broad class of pristine images** -- a clean encode->decode round-trip recovers 48/48 on chatgpt/firefly/random but FAILS (28-39/48, below the `_MATCH_48`=44 gate) on the FLUX fox, doubao, a flat FLUX generation, AND a clean synthetic flat fill with no watermark. The failure does NOT track texture; it goes with a degenerate **all-ones decode that is a CARRIER ARTIFACT, not a watermark** (synthetic clean image reproduces it). So `detect_invisible_watermark` is **positive-only**: trust a hit; a `None` is inconclusive unless a same-carrier positive-control embed first recovers >=44. Verified 2026-06-19; full caveat in `docs/watermarking-landscape.md`.
 - `trustmark_detector.py` — Adobe TrustMark open decoder (extra `trustmark`). Do NOT remove the JPEG re-encode false-positive gate — a lone TrustMark hit without it is almost always content noise.
 - `noai/watermark_remover.py` — `WatermarkRemover` with two diffusion pipelines selected by the explicit `pipeline` ctor arg, never inferred from `model_id`: `sdxl` (plain SDXL img2img) and `controlnet` (SDXL + canny ControlNet, **the DEFAULT since 2026-06-09**). Removal comes from the img2img `strength`; ControlNet only preserves text/face STRUCTURE — SynthID CAN survive controlnet on photoreal content at low strength. No face-restore extra ships, by validated decision (every restore approach looked MORE AI-generated).
+- `noai/tiling.py` — sliding-window tiled diffusion for large inputs (CLI `--tile`). `WatermarkRemover.remove_watermark` branches to `run_tiled` when `tile` is set AND the long side exceeds `tile_size`, refactoring the single-pass `_generate` into a per-tile `_generate_one` (the ControlNet edge map is rebuilt per tile inside it). Pure helpers `plan_tiles` (uniform-size tiles, last one flush to the edge) and `feather_weights` (strictly-positive separable taper -> partition-of-unity blend) are unit-tested without the model. New tile-blend tuning goes in those pure helpers; do not inline blend math into the runner.
 - `auto_config.py` + the content-detection layer were REMOVED 2026-06-09; `--auto` is a deprecated no-op (controlnet is the default pipeline and the adaptive polish is ON by default and self-gates to a no-op where there is no detail deficit).
 - `upscaler.py` — optional Real-ESRGAN pre-diffusion super-resolution for small inputs (extra `esrgan`, spandrel only). Manual opt-in; the default `--upscaler` stays `lanczos` and the engine always falls back to Lanczos on absence/error. ESRGAN can degrade faces and thin text.
 - `image_io.py` — Unicode-safe cv2 IO (issue #17). Every cv2 file read/write in the package routes through `imread`/`imwrite`; do not call `cv2.imread`/`cv2.imwrite` directly. `to_bgr(image)` is the shared channel normalizer — use it instead of inlining `cvtColor` branches.
@@ -76,7 +77,7 @@ Who embeds what (C2PA / IPTC / EXIF / TC260 AIGC / xAI signature / open and prop

 Compact list. Full measurements, incident history, and oracle-validation runs live in `docs/known-limitations.md` — **read the relevant section there before changing the diffusion pipelines, strength defaults, resolution handling, or metadata coverage.**

- `invisible` processes at native resolution for inputs >= 1024px long side and auto-upscales smaller inputs to a 1024px floor (`--min-resolution 0` disables; `--max-resolution N` is an opt-in cap to bound GPU/MPS memory). MPS OOM is memory-tier dependent, not a hard limit: ~24 GB unified memory falls back to CPU (slow but weight-identical output), 32 GB runs native on MPS. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size` — keep the logic there, unit-tested without the model.
+- `invisible` processes at native resolution for inputs >= 1024px long side and auto-upscales smaller inputs to a 1024px floor (`--min-resolution 0` disables; `--max-resolution N` is an opt-in cap to bound GPU/MPS memory). MPS OOM is memory-tier dependent, not a hard limit: ~24 GB unified memory falls back to CPU (slow but weight-identical output), 32 GB runs native on MPS. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size` — keep the logic there, unit-tested without the model. For large inputs that OOM, `--tile` is the **lossless** alternative to `--max-resolution`: sliding-window diffusion at native resolution, each tile near SDXL's 1024 training size, feather-blended over the overlap (`noai/tiling.py`). It only engages when the long side exceeds `--tile-size`; the geometry (`plan_tiles`) and the blend window (`feather_weights`) are pure and unit-tested (`tests/test_tiling.py`). Caveat: each tile is an independent low-strength regeneration, so at the certified removal strengths (0.20-0.30) tile drift is minimal but not zero; tiling is a memory workaround, not a quality upgrade over a single native pass.
 - fp16 VAE black-output (issues #29/#41): the fp16-fixed SDXL VAE (`madebyollin/sdxl-vae-fp16-fix`) is swapped in for the default SDXL checkpoint on cuda/xpu fp16, plus a model-agnostic backstop that detects a degenerate (all-black) fp16 output and re-runs once in fp32. cpu/mps run fp32 and never reproduce the bug.
 - Pyright first run is slow (2-3 min) due to ML deps (torch/diffusers/transformers stubs); full-project `uv run pyright` can stall for many minutes — scope it to changed files.
 - A third-party PIL plugin autoload (e.g. an HEIF/AVIF plugin) can raise a non-OSError (`ModuleNotFoundError`), not `UnidentifiedImageError`, when opening a file. Code that opens user-supplied or unknown-format files should `except Exception`, not just `OSError`/`UnidentifiedImageError`.
@@ -133,7 +133,7 @@ image → encode to latent space (VAE) at native resolution
      → decode back to pixels (VAE)
 ```

- Large inputs run at native resolution (no down-then-up round-trip, which was the main quality loss in issue #10); use `--max-resolution N` only to cap GPU/MPS memory on very large inputs. Small inputs (long side under 1024 px) are auto-upscaled to a 1024 px floor before diffusion, because SDXL distorts on a tiny latent, and the result is restored to the original size (a transparent quality boost). Disable the floor with `--min-resolution 0`. The floor upscale uses Lanczos by default; `--upscaler esrgan` (the `esrgan` extra) runs Real-ESRGAN first for sharper detail and falls back to Lanczos if the extra is absent. ESRGAN is a generic photo/texture GAN with no face/glyph prior, so it is best for photo/texture content -- it can degrade faces (the diffusion pass regenerates them, so the final recovers) and thin text; keep Lanczos for text-heavy inputs.
+- Large inputs run at native resolution (no down-then-up round-trip, which was the main quality loss in issue #10); use `--max-resolution N` only to cap GPU/MPS memory on very large inputs. For inputs that run out of GPU/MPS memory at native resolution, `--tile` is the lossless alternative to `--max-resolution`: it regenerates the image in overlapping, feather-blended tiles (each near SDXL's 1024 px size) so there is no downscale and no visible seam. It engages only when the long side exceeds `--tile-size` (default 1024; overlap `--tile-overlap`, default 128); pair it with `--max-resolution 0`. Small inputs (long side under 1024 px) are auto-upscaled to a 1024 px floor before diffusion, because SDXL distorts on a tiny latent, and the result is restored to the original size (a transparent quality boost). Disable the floor with `--min-resolution 0`. The floor upscale uses Lanczos by default; `--upscaler esrgan` (the `esrgan` extra) runs Real-ESRGAN first for sharper detail and falls back to Lanczos if the extra is absent. ESRGAN is a generic photo/texture GAN with no face/glyph prior, so it is best for photo/texture content -- it can degrade faces (the diffusion pass regenerates them, so the final recovers) and thin text; keep Lanczos for text-heavy inputs.

 > **Default strength is vendor-adaptive (no flag needed).** The tool reads the C2PA issuer to detect which vendor's SynthID is present and picks the strength accordingly: **OpenAI gpt-image → `0.20`**, **Google Gemini → `0.30`**, **unknown source → `0.30`**. The **same ladder applies to both pipelines** — these are the oracle-certified `controlnet` floors (June 2026 Modal cert, multi-seed). They also cover plain `sdxl`: the two pipelines have opposite hard cases (controlnet leaves SynthID on photoreal, sdxl on flat graphics), but on its own hard case sdxl is the weaker remover, so it needs at least controlnet's strength — using one certified ladder is the safe choice (margin-based for sdxl, not separately certified). The dominant factor is the vendor (Google's SynthID is ~3x more robust). There is no local SynthID detector, so if the oracle still reads SynthID, raise `--strength`; if you care more about preserving fine detail, lower it. (Caveat: Google's `0.30` was validated only at `--max-resolution 1536`; a very large native Gemini image may need ~`0.35`+.)
 >
@@ -354,7 +354,8 @@ remove-ai-watermarks invisible image.png -o clean.png --humanize 4.0 --unsharp 0
 # Large images run at native resolution; small ones are upscaled to a 1024 floor
 # first (disable with --min-resolution 0); --upscaler esrgan uses Real-ESRGAN for
 # that floor upscale (needs the 'esrgan' extra). On a very large image that OOMs the
-# GPU/MPS, cap the long side: --max-resolution 2048
+# GPU/MPS, either cap the long side (--max-resolution 2048, lossy) or pass --tile
+# to regenerate in overlapping feather-blended tiles at native resolution (lossless).
 # Strength is vendor-adaptive by default (OpenAI 0.20 / Google 0.30, same
 # for both pipelines); override with --strength. controlnet (text/face
 # structure preservation) is the default pipeline; --pipeline sdxl opts down
@@ -482,7 +483,7 @@ Won't fix:

 - **Visible-mark and metadata removal is lossless.** Reverse-alpha recovers the original pixels under the mark; metadata stripping never touches image data.
 - **The invisible (SynthID) path is lossy and not guaranteed.** It runs a low-strength SDXL img2img regeneration, so it softens fine detail and is content-dependent. There is no public SynthID decoder, so the tool cannot verify removal locally; confirm with the Gemini app's "Verify with SynthID" oracle and raise `--strength` if it still detects. A vendor can change the scheme at any time, so treat this as an arms race, not a permanent fix.
- **Large images are not tiled yet.** The SynthID path runs at the diffusion model's native resolution and caps or upscales larger inputs instead of tiling them, so very large images lose detail. (Nano Banana 2 is natively 1024px; GPT Image 2 supports 4K experimentally.) Tiled diffusion is the planned fix.
+- **Large images: native by default, opt-in tiling for OOM.** The SynthID path runs at the diffusion model's native resolution; on a memory-constrained GPU/MPS you can either cap the long side with `--max-resolution` (lossy downscale) or pass `--tile` to regenerate in overlapping, feather-blended tiles at native resolution (lossless, no seam). Tiling is a memory workaround, not a quality upgrade over a single native pass: each tile is an independent low-strength regeneration. (Nano Banana 2 is natively 1024px; GPT Image 2 supports 4K experimentally.)
 - **Out of scope:** defeating trained AI-vs-real classifiers like Hive (see [Threat model](#threat-model)), visible-logo removal from video, and any guarantee that a stripped copy is untraceable server-side.

 ## Legal
@@ -13,7 +13,16 @@ pipelines, strength defaults, or metadata coverage.

 `invisible` pipeline processes at **native resolution for inputs whose long side is >= 1024px**, and **auto-upscales smaller inputs UP to a 1024px floor** (`min_resolution=1024`, the default; `--min-resolution 0` disables) before diffusion -- SDXL img2img distorts badly on a tiny latent (a 381x512 portrait wrecks at native, the #36 follow-up), and the output is restored to the original input size so the floor is a transparent quality boost (it adds time/memory on small inputs). The floor upscale uses Lanczos by default; **`--upscaler esrgan`** (opt-in, the `esrgan` extra) runs Real-ESRGAN first for better detail before the Lanczos resize to the exact target (`upscaler.py` / `InvisibleEngine._esrgan_upscale`, falls back to Lanczos if the extra is absent). `max_resolution=0` (default) means no downscale cap, matching the hosted raiw.cc backend (fal fast-sdxl, no pre-downscale). The old forced downscale-to-1024 -> upscale-back round-trip for LARGE images was the main quality loss (issue #10) and is gone; at strength ~0.05 SDXL img2img does not need a downscale.

-**Final `--unsharp` post-filter (`humanizer.unsharp_mask`, opt-in, default 0):** applied LAST (after the face-restore pass, else it would be smoothed over) to counter the soft/over-smoothed look diffusion + restoration leave (an AI tell); ~0.5-0.8 safe, higher risks halos. Pairs with `--humanize` (grain adds sensor-noise texture, unsharp adds crispness). `--max-resolution N` re-introduces an opt-in long-side cap purely to bound GPU/MPS memory on very large inputs (it reintroduces the lossy round-trip). For huge images that OOM at native, tile-based diffusion is still the proper long-term fix.
+**Final `--unsharp` post-filter (`humanizer.unsharp_mask`, opt-in, default 0):** applied LAST (after the face-restore pass, else it would be smoothed over) to counter the soft/over-smoothed look diffusion + restoration leave (an AI tell); ~0.5-0.8 safe, higher risks halos. Pairs with `--humanize` (grain adds sensor-noise texture, unsharp adds crispness). `--max-resolution N` re-introduces an opt-in long-side cap purely to bound GPU/MPS memory on very large inputs (it reintroduces the lossy round-trip). For huge images that OOM at native, **`--tile` is the lossless alternative** -- see the tiled-diffusion subsection below.
+
+### Tiled diffusion for large inputs (`--tile`, issue #10)
+
+`--tile` (OFF by default; `--tile-size` default 1024, `--tile-overlap` default 128) processes the diffusion pass in overlapping sliding-window tiles instead of one forward pass, so a large image is regenerated at **native resolution** without the OOM and without the lossy `--max-resolution` downscale round-trip. It engages only when the long side exceeds `--tile-size`; a sub-tile image runs a single pass unchanged. `WatermarkRemover.remove_watermark` refactors the single-image `_generate` into a per-tile `_generate_one` (the ControlNet canny edge map is rebuilt per tile, so structure preservation works tile-local) and routes it through `noai.tiling.run_tiled` when tiling is active. The geometry and blend math are pure helpers, unit-tested without the model (`tests/test_tiling.py`):
+
+- `plan_tiles(w, h, tile_size, overlap)` lays out a row-major grid where every tile is exactly `tile_size` (the last tile on each axis is pulled back flush to the far edge, simply overlapping its predecessor more). Uniform tile size keeps each diffusion pass at SDXL's preferred dimension.
+- `feather_weights(w, h, overlap)` is a separable linear taper, ~1 in the interior and ramping toward each edge, kept **strictly positive** so the normalised accumulate-and-divide blend (`accum / weight_sum`) is a partition of unity: a region covered by one feathered edge (an image corner) still divides cleanly. Identical (unchanged) tiles therefore reconstruct the input exactly -- the seam-free guarantee, asserted in `test_identity_generate_reconstructs_image`.
+
+CAVEAT: each tile is an **independent** low-strength regeneration. At the certified removal strengths (0.20-0.30) the per-tile drift is small and the feather blend hides the seams, but tiling is a memory workaround, not a quality upgrade over a single native pass -- a 32 GB MPS box that clears the native UNet peak should prefer no tiling. The MPS->CPU fallback still applies per tile; if the first tile falls back to CPU, the device stays CPU for the rest of the image.

 **Concrete MPS data points (the OOM is memory-tier-dependent, NOT a hard MPS limit):** on a ~24 GB unified-memory machine (verified 2026-05-25, 1254x1254 gpt-image SDXL, fp32) native res OOMs at the *UNet* step (peak ~17 GiB), not only the VAE decode, and the auto-fallback in `img2img_runner` reloads on CPU and finishes (slow, ~13 min) -- the output is still weight-identical and defeats SynthID, so "looks hung/crashed" on Mac is usually this CPU fallback, not a pipeline error. On a **32 GB** unified-memory machine the same default SDXL pass runs entirely on MPS with **no CPU fallback** (verified 2026-05-31, 1122x1402 gpt-image, `all`/default, ~155 s end-to-end), so 32 GB clears the native-res UNet peak that 24 GB could not. Adding `enable_vae_tiling()` alone does NOT prevent the 24 GB OOM (the peak is the UNet, not the VAE). The fast Mac workarounds for memory-constrained machines are fp16 on MPS (roughly halves memory) or `--max-resolution` to cap the long side; neither is wired as the default. The `controlnet` pipeline adds the canny ControlNet weights on top of SDXL, so its peak is a bit higher than the plain `default` pass; the same MPS->CPU fallback covers an OOM. The native-vs-cap-vs-floor decision lives in the pure helper `invisible_engine._target_size(w, h, max_resolution, min_resolution)` (returns `None` for native, a target tuple for a downscale cap OR an upscale floor; cap takes precedence, the floor is skipped on a min>max misconfig) so it is unit-tested (`tests/test_invisible_engine.py::TestTargetSize`, the #10/#15/#36 regression guard) without loading the model -- keep that logic in the helper, don't re-inline it.

@@ -191,6 +191,12 @@ At the shared low removal strength the canny edge-conditioning keeps the regener

 **But the reverse also holds: a flat-graphic logo/poster SURVIVED `default` while clearing controlnet** -- removal at the low strength is content×pipeline dependent and neither pipeline is universally safe; the real lever is a higher strength. See the controlnet Known-limitations bullet for the full table + root cause. Canny holds face STRUCTURE but NOT identity (the regenerated face drifts in likeness -- canny carries edges, not identity). The drifted cleaned face is the LEAST-AI state we can reach without re-introducing SynthID; the library does NOT ship a face-restore extra. Every restore approach we evaluated (GFPGAN-on-cleaned, PhotoMaker-V2 txt2img, InstantID txt2img, InstantID img2img-on-cleaned at three parameter sweeps, 2026-06-04 - 2026-06-08 Modal cert sweeps) regenerated the face from an ArcFace embedding via SDXL diffusion -- which makes the output face look MORE AI-generated, not less. Empirical conclusion in `docs/synthid-robust-identity-research-2026-06-08.md` "Empirical follow-up". For production face preservation, ship the cleaned image as-is. `controlnet_conditioning_scale` (ctor arg, default 1.0) is the structure-preservation knob. Same dtype rule as `default` (fp32 on cpu/mps, fp16 only on cuda/xpu; the fp16-fixed SDXL VAE `_SDXL_FP16_VAE_ID` is swapped in on fp16 GPUs -- issue #29) and the same MPS->CPU fallback (reload on cpu/fp32, drop a non-cpu generator, retry once).

+**Tiled diffusion (`tile`/`tile_size`/`tile_overlap` ctor-path args, CLI `--tile`, issue #10):** for large inputs that OOM at native resolution, `remove_watermark` can process the diffusion pass in overlapping sliding-window tiles instead of one forward pass — the lossless alternative to a `--max-resolution` downscale. The single-image generation closure was refactored into `_generate_one(img)` (dispatches controlnet/img2img, generator shared so the seed advances deterministically across tiles), and `_generate()` routes it through `noai.tiling.run_tiled` when `tile` is set AND `max(init_image.size) > tile_size` (a sub-tile image runs one pass unchanged). The ControlNet canny edge map is rebuilt per tile inside `_generate_one`, so structure preservation is tile-local. See `noai/tiling.py` below and the tiled-diffusion subsection in `docs/known-limitations.md` for the geometry, the partition-of-unity blend, and the quality caveat.
+
+## `noai/tiling.py`
+
+Pure sliding-window tiling for the diffusion path (no torch import; numpy/PIL only). `plan_tiles(w, h, tile_size, overlap)` returns a row-major grid of uniform-size `Tile` boxes — every tile is exactly `tile_size` (the SDXL training size), with the last tile on each axis pulled back flush to the far edge (`_axis_positions` clamps a pathological `overlap >= tile` to `tile - 1` so the step stays >= 1). `feather_weights(w, h, overlap)` is a separable linear taper (1 in the interior, ramping toward each edge) floored at `_WEIGHT_EPS` so it is **strictly positive everywhere** — that makes the normalised `accum / weight_sum` blend a partition of unity, so identical/unchanged tiles reconstruct the input exactly (the seam-free guarantee). `run_tiled(generate_tile, image, tile_size, overlap, set_progress)` is the orchestration loop: crop each planned tile, call `generate_tile` (one diffusion pass on a single PIL tile — injected, so this stays decoupled from the pipeline), resize a latent-grid-rounded result back to the exact tile size, and feather-accumulate. All three are unit-tested without the model (`tests/test_tiling.py`: axis math, grid coverage, taper shape/symmetry/positivity, identity reconstruction, per-tile call count, and the resize-back path). New blend tuning belongs in these pure helpers, not inlined into the runner.
+
 ## `auto_config.py` (REMOVED 2026-06-09)

 **`auto_config.py` + the content-detection layer were REMOVED 2026-06-09.**
@@ -189,6 +189,33 @@ _adaptive_polish_option = click.option(
    "no-op there. Pass --no-adaptive-polish to disable. Independent of --unsharp/--humanize.",
 )

+
+# Tiled-diffusion knobs, shared by the diffusion commands (invisible/all/batch).
+# Tiling is the lossless alternative to --max-resolution for large inputs that OOM
+# on MPS/GPU: process at native resolution in overlapping, feather-blended tiles.
+def _tile_options(f: Any) -> Any:
+    """Apply the --tile / --tile-size / --tile-overlap options to a command."""
+    f = click.option(
+        "--tile-overlap",
+        type=int,
+        default=128,
+        help="Overlap between adjacent tiles in px (feather-blended, no seam). Default 128.",
+    )(f)
+    f = click.option(
+        "--tile-size",
+        type=int,
+        default=1024,
+        help="Tile dimension in px for --tile (SDXL's training size). Default 1024.",
+    )(f)
+    return click.option(
+        "--tile/--no-tile",
+        default=False,
+        help="Process large images in overlapping tiles instead of one forward pass -- the lossless "
+        "alternative to --max-resolution for inputs that OOM on MPS/GPU. Engages only when the long "
+        "side exceeds --tile-size; pair with --max-resolution 0 (default) to keep native resolution. Default off.",
+    )(f)
+
+
 # HuggingFace model + CFG knobs, shared by the diffusion commands (invisible/all/batch)
 # so the surface stays identical across them.
 _model_option = click.option(
@@ -668,6 +695,7 @@ def cmd_erase(
@_guidance_scale_option
@_auto_option
@_adaptive_polish_option
+@_tile_options
@click.pass_context
 def cmd_invisible(
    ctx: click.Context,
@@ -689,6 +717,9 @@ def cmd_invisible(
    guidance_scale: float | None,
    auto: bool,
    adaptive_polish: bool,
+    tile: bool,
+    tile_size: int,
+    tile_overlap: int,
 ) -> None:
    """Remove invisible AI watermarks (SynthID, StableSignature, TreeRing).

@@ -747,6 +778,9 @@ def cmd_invisible(
        min_resolution=min_resolution,
        upscaler=upscaler,
        vendor=vendor,
+        tile=tile,
+        tile_size=tile_size,
+        tile_overlap=tile_overlap,
    )
    elapsed = time.monotonic() - t0

@@ -917,6 +951,7 @@ def cmd_identify(ctx: click.Context, source: Path, no_visible: bool, as_json: bo
@_guidance_scale_option
@_auto_option
@_adaptive_polish_option
+@_tile_options
@click.pass_context
 def cmd_all(
    ctx: click.Context,
@@ -940,6 +975,9 @@ def cmd_all(
    guidance_scale: float | None,
    auto: bool,
    adaptive_polish: bool,
+    tile: bool,
+    tile_size: int,
+    tile_overlap: int,
 ) -> None:
    """Remove ALL watermarks: visible + invisible + metadata.

@@ -1044,6 +1082,9 @@ def cmd_all(
                min_resolution=min_resolution,
                upscaler=upscaler,
                vendor=vendor,
+                tile=tile,
+                tile_size=tile_size,
+                tile_overlap=tile_overlap,
            )
            console.print("    Invisible watermark removed")

@@ -1121,6 +1162,9 @@ def _process_batch_image(
    model: str | None = None,
    guidance_scale: float | None = None,
    adaptive_polish: bool = False,
+    tile: bool = False,
+    tile_size: int = 1024,
+    tile_overlap: int = 128,
 ) -> None:
    """Process a single image for batch mode.

@@ -1179,6 +1223,9 @@ def _process_batch_image(
                max_resolution=max_resolution,
                min_resolution=min_resolution,
                upscaler=upscaler,
+                tile=tile,
+                tile_size=tile_size,
+                tile_overlap=tile_overlap,
                # Detect the vendor from the pristine original (`img_path`), not the
                # visible-processed `out_path` whose C2PA is already gone.
                vendor=vendor_for_strength(img_path),
@@ -1238,6 +1285,7 @@ def _process_batch_image(
@_guidance_scale_option
@_auto_option
@_adaptive_polish_option
+@_tile_options
@click.pass_context
 def cmd_batch(
    ctx: click.Context,
@@ -1261,6 +1309,9 @@ def cmd_batch(
    guidance_scale: float | None,
    auto: bool,
    adaptive_polish: bool,
+    tile: bool,
+    tile_size: int,
+    tile_overlap: int,
 ) -> None:
    """Process all images in a directory."""
    _banner()
@@ -1321,6 +1372,9 @@ def cmd_batch(
                    model=model,
                    guidance_scale=guidance_scale,
                    adaptive_polish=adaptive_polish,
+                    tile=tile,
+                    tile_size=tile_size,
+                    tile_overlap=tile_overlap,
                )
                processed += 1

@@ -170,6 +170,9 @@ class InvisibleEngine:
        unsharp: float = 0.0,
        adaptive_polish: bool = False,
        upscaler: str = "lanczos",
+        tile: bool = False,
+        tile_size: int = 1024,
+        tile_overlap: int = 128,
    ) -> Path:
        """Remove invisible watermark from an image.

@@ -205,6 +208,13 @@ class InvisibleEngine:
                via the ``esrgan`` extra). Only applies when UPscaling (the floor
                case); a ``max_resolution`` downscale always uses Lanczos. Falls back
                to Lanczos if the extra is absent.
+            tile: Process the diffusion pass in overlapping tiles instead of one
+                forward pass -- the lossless alternative to ``max_resolution`` for
+                large inputs that OOM on MPS/GPU. Engages only when the long side
+                exceeds ``tile_size``. Pair with ``max_resolution=0`` (the default)
+                so the input keeps its native resolution.
+            tile_size: Tile dimension in px (default 1024).
+            tile_overlap: Overlap between adjacent tiles in px (default 128).

        Returns:
            Path to the cleaned image.
@@ -261,6 +271,9 @@ class InvisibleEngine:
                guidance_scale=guidance_scale,
                seed=seed,
                vendor=vendor,
+                tile=tile,
+                tile_size=tile_size,
+                tile_overlap=tile_overlap,
            )

            # Post-processing chain: decode the diffusion output ONCE, apply the
@@ -0,0 +1,142 @@
+"""Sliding-window tiled diffusion for large images.
+
+The img2img / ControlNet pipeline denoises the WHOLE image in one forward pass,
+so it OOMs on MPS/GPU above ~2K (issue #10). Tiling splits the image into
+overlapping tiles -- each kept near SDXL's ~1024 training size -- regenerates
+each tile independently, and feather-blends the overlaps. The result is processed
+at NATIVE resolution with no seam: the lossless alternative to the
+``--max-resolution`` downscale (which trades quality for a smaller forward pass).
+
+The geometry (``plan_tiles``) and the blend weighting (``feather_weights``) are
+pure functions, unit-tested without the diffusion model. ``run_tiled`` is the
+orchestration loop; it takes a ``generate_tile`` callable (one img2img/ControlNet
+pass on a single PIL tile) so it stays decoupled from the pipeline internals.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING, Any, NamedTuple
+
+if TYPE_CHECKING:
+    from collections.abc import Callable
+
+    from numpy.typing import NDArray
+    from PIL import Image as PILImage
+
+logger = logging.getLogger(__name__)
+
+# Strictly-positive floor for the accumulated blend weights so a region covered
+# by a single feathered tile edge (an image corner, no neighbour to blend with)
+# never divides by zero.
+_WEIGHT_EPS = 1e-3
+
+
+class Tile(NamedTuple):
+    """A tile crop box in the source image: top-left ``(x, y)`` + ``width``/``height``."""
+
+    x: int
+    y: int
+    width: int
+    height: int
+
+
+def _axis_positions(length: int, tile: int, overlap: int) -> list[int]:
+    """Tile start offsets along one axis, last tile flush to the far edge.
+
+    Every interior tile is exactly ``tile`` long; the final tile is pulled back
+    to ``length - tile`` so it ends exactly at the edge (it simply overlaps its
+    predecessor a little more). Keeping all tiles the same size is what lets the
+    diffusion pass run at SDXL's preferred dimension on every tile.
+    """
+    if tile <= 0:
+        raise ValueError(f"tile must be positive, got {tile}")
+    if length <= tile:
+        return [0]
+    # Guarantee forward progress even on a pathological overlap >= tile.
+    overlap = min(max(overlap, 0), tile - 1)
+    step = tile - overlap
+    positions = list(range(0, length - tile + 1, step))
+    last = length - tile
+    if positions[-1] != last:
+        positions.append(last)
+    return positions
+
+
+def plan_tiles(width: int, height: int, tile_size: int, overlap: int) -> list[Tile]:
+    """Lay out a grid of overlapping tiles covering ``width`` x ``height``.
+
+    All tiles are ``min(tile_size, width)`` x ``min(tile_size, height)`` (uniform
+    size; the image itself when it fits in one tile). Returned in row-major order.
+    """
+    xs = _axis_positions(width, tile_size, overlap)
+    ys = _axis_positions(height, tile_size, overlap)
+    tile_w = min(tile_size, width)
+    tile_h = min(tile_size, height)
+    return [Tile(x, y, tile_w, tile_h) for y in ys for x in xs]
+
+
+def feather_weights(width: int, height: int, overlap: int) -> NDArray[Any]:
+    """A 2D blend window: ~1 in the interior, ramping down toward each edge.
+
+    Separable linear taper over ``overlap`` pixels from every edge (capped at
+    half the tile so short tiles still taper symmetrically). Strictly positive
+    everywhere, so the normalised blend is well-defined even at an image corner
+    that only one tile covers.
+    """
+    import numpy as np
+
+    def ramp(n: int) -> NDArray[Any]:
+        w = np.ones(n, dtype=np.float32)
+        if overlap > 0 and n > 1:
+            ramp_len = min(overlap, max(1, n // 2))
+            taper = (np.arange(ramp_len, dtype=np.float32) + 1.0) / (ramp_len + 1.0)
+            w[:ramp_len] = taper
+            w[n - ramp_len :] = taper[::-1]
+        return w
+
+    weights = np.outer(ramp(height), ramp(width))
+    np.maximum(weights, _WEIGHT_EPS, out=weights)
+    return weights
+
+
+def run_tiled(
+    generate_tile: Callable[[PILImage.Image], PILImage.Image],
+    image: PILImage.Image,
+    tile_size: int,
+    overlap: int,
+    set_progress: Callable[[str], None] | None = None,
+) -> PILImage.Image:
+    """Tile ``image``, run ``generate_tile`` per tile, and feather-blend the result.
+
+    ``generate_tile`` runs one diffusion pass on a single RGB PIL tile and returns
+    the regenerated tile (the ControlNet control image is built per tile inside it,
+    so each tile gets its own edge map). A pass that rounds dimensions to the latent
+    grid is resized back to the exact tile size before blending.
+    """
+    import numpy as np
+    from PIL import Image
+
+    width, height = image.size
+    tiles = plan_tiles(width, height, tile_size, overlap)
+    accum = np.zeros((height, width, 3), dtype=np.float32)
+    weight_sum = np.zeros((height, width, 1), dtype=np.float32)
+
+    # All tiles share one size (plan_tiles is uniform), so the feather window is
+    # loop-invariant -- compute it once.
+    weights = feather_weights(tiles[0].width, tiles[0].height, overlap)[:, :, None]
+
+    total = len(tiles)
+    for index, tile in enumerate(tiles, start=1):
+        if set_progress is not None:
+            set_progress(f"Tiled diffusion: tile {index}/{total} at ({tile.x},{tile.y}) {tile.width}x{tile.height}...")
+        crop = image.crop((tile.x, tile.y, tile.x + tile.width, tile.y + tile.height))
+        result = generate_tile(crop)
+        if result.size != (tile.width, tile.height):
+            result = result.resize((tile.width, tile.height), Image.Resampling.LANCZOS)
+        arr = np.asarray(result.convert("RGB"), dtype=np.float32)
+        accum[tile.y : tile.y + tile.height, tile.x : tile.x + tile.width] += arr * weights
+        weight_sum[tile.y : tile.y + tile.height, tile.x : tile.x + tile.width] += weights
+
+    blended = accum / np.maximum(weight_sum, _WEIGHT_EPS)
+    return Image.fromarray(np.clip(blended, 0, 255).astype(np.uint8))
@@ -485,6 +485,9 @@ class WatermarkRemover:
        guidance_scale: float | None = None,
        seed: int | None = None,
        vendor: str | None = None,
+        tile: bool = False,
+        tile_size: int = 1024,
+        tile_overlap: int = 128,
    ) -> Path:
        """Remove watermark from an image using regeneration attack.

@@ -501,6 +504,13 @@ class WatermarkRemover:
                input with ``watermark_profiles.vendor_for_strength`` before processing
                strips the metadata; the caller passes it down so display and execution
                agree.
+            tile: Process the image in overlapping tiles instead of one forward pass.
+                The lossless alternative to a ``--max-resolution`` downscale for large
+                inputs that OOM on MPS/GPU (issue #10). Only engages when the long side
+                exceeds ``tile_size``; smaller images run a single pass unchanged.
+            tile_size: Tile dimension in px (default 1024, SDXL's training size).
+            tile_overlap: Overlap between adjacent tiles in px (default 128), feather-
+                blended so there is no visible seam.

        Returns:
            Path to the cleaned image.
@@ -541,10 +551,19 @@ class WatermarkRemover:

        _total_start = time.monotonic()

-        def _generate() -> Image.Image:
+        def _generate_one(img: Image.Image) -> Image.Image:
            if self.model_profile == "controlnet":
-                return self._run_controlnet(init_image, strength, num_inference_steps, guidance_scale, generator)
-            return self._run_img2img(init_image, strength, num_inference_steps, guidance_scale, generator)
+                return self._run_controlnet(img, strength, num_inference_steps, guidance_scale, generator)
+            return self._run_img2img(img, strength, num_inference_steps, guidance_scale, generator)
+
+        def _generate() -> Image.Image:
+            # Tile only when asked AND the image is larger than one tile; otherwise a
+            # single full-image pass (tiling a sub-tile image is pure overhead).
+            if tile and max(init_image.size) > tile_size:
+                from remove_ai_watermarks.noai.tiling import run_tiled
+
+                return run_tiled(_generate_one, init_image, tile_size, tile_overlap, self._set_progress)
+            return _generate_one(init_image)

        cleaned_image = _generate()

@@ -0,0 +1,140 @@
+"""Unit tests for the sliding-window tiled-diffusion helpers (no GPU/model).
+
+``tiling`` is pure numpy/PIL: the geometry (``plan_tiles`` / ``_axis_positions``),
+the feather window (``feather_weights``), and the blend loop (``run_tiled``) are all
+exercised here with a plain callable standing in for the diffusion pass, so the
+seam-free reconstruction and the tile layout are guarded without loading SDXL.
+"""
+
+from __future__ import annotations
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from remove_ai_watermarks.noai.tiling import (
+    Tile,
+    _axis_positions,
+    feather_weights,
+    plan_tiles,
+    run_tiled,
+)
+
+
+class TestAxisPositions:
+    def test_single_tile_when_length_fits(self):
+        assert _axis_positions(800, 1024, 128) == [0]
+        assert _axis_positions(1024, 1024, 128) == [0]
+
+    def test_two_tiles_last_flush_to_edge(self):
+        # 1500 wide, tile 1024, step 1024-128=896: starts [0], then last = 1500-1024=476.
+        assert _axis_positions(1500, 1024, 128) == [0, 476]
+
+    def test_uniform_step_then_flush(self):
+        # 3000, tile 1024, step 896 -> 0, 896, 1792; last = 1976 appended (2688 would
+        # overrun). The regular range stops at 1792 (next 2688 > 1976), so flush adds 1976.
+        assert _axis_positions(3000, 1024, 128) == [0, 896, 1792, 1976]
+
+    def test_no_duplicate_when_range_already_hits_edge(self):
+        # length-tile divisible by step: the last range entry already equals the edge.
+        # 1024*2-128 = 1920 width, step 896, last = 896 -> range gives [0, 896]; no dup.
+        assert _axis_positions(1920, 1024, 128) == [0, 896]
+
+    def test_overlap_at_least_tile_still_progresses(self):
+        # Pathological overlap >= tile is clamped to tile-1 so step stays >= 1.
+        positions = _axis_positions(2000, 1024, 5000)
+        assert positions[0] == 0
+        assert positions[-1] == 2000 - 1024
+        assert positions == sorted(positions)
+
+    def test_invalid_tile_raises(self):
+        with pytest.raises(ValueError, match="tile must be positive"):
+            _axis_positions(100, 0, 10)
+
+
+class TestPlanTiles:
+    def test_single_tile_for_small_image(self):
+        tiles = plan_tiles(800, 600, 1024, 128)
+        assert tiles == [Tile(0, 0, 800, 600)]
+
+    def test_grid_is_row_major_and_uniform_size(self):
+        tiles = plan_tiles(1500, 1500, 1024, 128)
+        # 2x2 grid: xs/ys both [0, 476].
+        assert [(t.x, t.y) for t in tiles] == [(0, 0), (476, 0), (0, 476), (476, 476)]
+        # Every tile is exactly tile_size (uniform -> SDXL-friendly).
+        assert all((t.width, t.height) == (1024, 1024) for t in tiles)
+
+    def test_tiles_cover_the_full_canvas(self):
+        width, height = 2600, 1800
+        tiles = plan_tiles(width, height, 1024, 128)
+        covered = np.zeros((height, width), dtype=bool)
+        for t in tiles:
+            covered[t.y : t.y + t.height, t.x : t.x + t.width] = True
+        assert covered.all()
+
+
+class TestFeatherWeights:
+    def test_shape_matches_tile(self):
+        assert feather_weights(64, 48, 8).shape == (48, 64)
+
+    def test_strictly_positive(self):
+        assert (feather_weights(64, 64, 16) > 0).all()
+
+    def test_interior_higher_than_edge(self):
+        w = feather_weights(64, 64, 16)
+        assert w[32, 32] == pytest.approx(1.0)
+        # The corner sits in the taper, so it is well below the interior.
+        assert w[0, 0] < w[32, 32]
+
+    def test_symmetric(self):
+        w = feather_weights(64, 64, 16)
+        assert np.allclose(w, w[::-1, :])
+        assert np.allclose(w, w[:, ::-1])
+
+    def test_zero_overlap_is_flat(self):
+        # No taper requested -> a flat window of ones.
+        assert np.allclose(feather_weights(32, 32, 0), 1.0)
+
+
+class TestRunTiled:
+    def test_identity_generate_reconstructs_image(self):
+        # A blend of identical (unchanged) tiles must reproduce the input exactly,
+        # regardless of overlap -- the feather weights are a partition-of-unity once
+        # normalised. This is the seam-free guarantee.
+        rng = np.random.default_rng(0)
+        arr = rng.integers(0, 256, size=(1500, 1300, 3), dtype=np.uint8)
+        image = Image.fromarray(arr)
+
+        out = run_tiled(lambda tile: tile, image, tile_size=512, overlap=64)
+
+        assert out.size == image.size
+        assert np.abs(np.asarray(out, dtype=np.int16) - arr.astype(np.int16)).max() <= 1
+
+    def test_generate_called_once_per_tile(self):
+        calls: list[tuple[int, int]] = []
+
+        def generate(tile: Image.Image) -> Image.Image:
+            calls.append(tile.size)
+            return tile
+
+        image = Image.new("RGB", (1500, 1500), (120, 130, 140))
+        run_tiled(generate, image, tile_size=1024, overlap=128)
+
+        assert len(calls) == len(plan_tiles(1500, 1500, 1024, 128)) == 4
+
+    def test_single_tile_path_for_small_image(self):
+        image = Image.new("RGB", (300, 200), (10, 20, 30))
+        out = run_tiled(lambda tile: tile, image, tile_size=1024, overlap=128)
+        assert out.size == (300, 200)
+        assert np.asarray(out)[0, 0].tolist() == [10, 20, 30]
+
+    def test_mismatched_generate_output_is_resized_back(self):
+        # A pipeline that rounds dims to the latent grid returns a slightly different
+        # size; run_tiled must resize it back so the blend buffers line up.
+        def generate(tile: Image.Image) -> Image.Image:
+            w, h = tile.size
+            return tile.resize((w - w % 8, h - h % 8), Image.Resampling.LANCZOS)
+
+        image = Image.new("RGB", (1500, 1100), (200, 100, 50))
+        out = run_tiled(generate, image, tile_size=1024, overlap=128)
+        assert out.size == (1500, 1100)