docs: certify qwen scrub floors (OpenAI 0.10 seed-robust, Gemini 0.25)

Oracle seed-repeat + floor refinement (2026-06-20, data/qwen_in):
- OpenAI floor 0.10 is SEED-ROBUST: 0.05 and 0.075 still detected; 0.10 clean on
  seeds 0-4 (5/5) -> a random seed is safe.
- Gemini floor lowered 0.30 -> 0.25 (0.20 still detected, 0.25 clean on both
  images). Single-seed (seed 0): the Gemini oracle rate-limits volume seed-repeat,
  so pin a seed in prod rather than relying on seed-robustness there.

Re-measured fidelity at the certified floors (controlnet 0.15 vs Qwen 0.25 for
Gemini): faces still favor controlnet (ArcFace 0.546 vs 0.382, lapvar 0.62 vs
0.40); the short-CJK text case is now a TIE (gemini_1 0.037 vs 0.037 -- the earlier
Qwen 0.000 was at 0.30, not the floor). Qwen's text win holds on substantial
Latin/mixed text (OpenAI 0.385 vs 0.241 / 0.341 vs 0.290). Update watermark_profiles
comment, CLAUDE.md, module-internals, known-limitations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-20 15:16:51 -07:00
parent 373b910a60
commit 7dddfef14e
4 changed files with 13 additions and 9 deletions
+1 -1
View File
@@ -181,7 +181,7 @@ Root cause: bad alpha (under-estimated, max ~0.65) + fixed-no-inpaint + tight bo
**`sdxl`** (renamed from `default` 2026-06-09; `default` kept as a back-compat alias via `normalize_profile`) runs plain SDXL img2img (`_run_img2img`); it is the lighter opt-down alternative (no ControlNet weights).
**`qwen`** (`_run_qwen`, `_load_qwen_pipeline`) runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights). The scrub still comes from the img2img `strength`; Qwen's value is **text preservation** (incl. CJK and small text). **Metric-measured nuance (2026-06-19, `scripts/fidelity_metrics.py`, do NOT trust the eyeball here — it misled). Compare ONLY at each pipeline's oracle-confirmed scrub floor (outputs where SynthID is removed in BOTH — an equal-strength compare is invalid where it leaves one un-scrubbed; Qwen at 0.15 does not clear Gemini): Qwen wins TEXT (lower OCR CER across EN/RU/ZH, perfect Chinese) but controlnet wins FACES (higher Laplacian-variance retention and lower LPIPS — Qwen smooths faces MORE; ArcFace identity favors controlnet 0.546 vs 0.331 at the Gemini floors).** So Qwen is the better text-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better. Specifics: bf16 on CUDA (fp16 risks overflow on the 20B MMDiT — see the dtype branch in `__init__`); loads `QWEN_MODEL_ID` unless `--model` is overridden; the call shape lives in the pure module helper `_build_qwen_kwargs` (unit-tested without torch in `tests/test_platform.py::TestQwenKwargs`), which uses Qwen's `true_cfg_scale` (NOT SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it; ~4.0 is typical, the SDXL default 7.5 is high for Qwen) and an explicit `negative_prompt` (`_QWEN_PROMPT`/`_QWEN_NEGATIVE`). It is CUDA/cloud-class (the 20B does not fit MPS), so `_run_qwen` has NO MPS->CPU fallback — an error propagates. `_load_qwen_pipeline` raises a clear ImportError if the installed diffusers lacks `QwenImageImg2ImgPipeline`. **Prototype oracle floors (Modal A100-80GB, single seed, 2026-06-19, PENDING seed-repeat cert): OpenAI clears at strength ~0.10, Gemini at ~0.30 (0.20 still detected) — both controls were SynthID-positive. The Gemini floor (0.30) is HIGHER than the certified controlnet Gemini floor (0.15), and `resolve_strength` is shared/pipeline-independent, so pass an explicit `--strength` for Gemini content on `qwen` until a Qwen-specific ladder is certified.** Fidelity vs controlnet was measured (`scripts/fidelity_metrics.py`), NOT eyeballed — see the metric nuance above: Qwen wins text, controlnet wins faces.
**`qwen`** (`_run_qwen`, `_load_qwen_pipeline`) runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights). The scrub still comes from the img2img `strength`; Qwen's value is **text preservation** (incl. CJK and small text). **Metric-measured nuance (2026-06-19, `scripts/fidelity_metrics.py`, do NOT trust the eyeball here — it misled). Compare ONLY at each pipeline's oracle-confirmed scrub floor (outputs where SynthID is removed in BOTH — an equal-strength compare is invalid where it leaves one un-scrubbed; Qwen at 0.15 does not clear Gemini): Qwen wins TEXT (lower OCR CER across EN/RU/ZH, perfect Chinese) but controlnet wins FACES (higher Laplacian-variance retention and lower LPIPS — Qwen smooths faces MORE; ArcFace identity favors controlnet 0.546 vs 0.331 at the Gemini floors).** So Qwen is the better text-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better. Specifics: bf16 on CUDA (fp16 risks overflow on the 20B MMDiT — see the dtype branch in `__init__`); loads `QWEN_MODEL_ID` unless `--model` is overridden; the call shape lives in the pure module helper `_build_qwen_kwargs` (unit-tested without torch in `tests/test_platform.py::TestQwenKwargs`), which uses Qwen's `true_cfg_scale` (NOT SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it; ~4.0 is typical, the SDXL default 7.5 is high for Qwen) and an explicit `negative_prompt` (`_QWEN_PROMPT`/`_QWEN_NEGATIVE`). It is CUDA/cloud-class (the 20B does not fit MPS), so `_run_qwen` has NO MPS->CPU fallback — an error propagates. `_load_qwen_pipeline` raises a clear ImportError if the installed diffusers lacks `QwenImageImg2ImgPipeline`. **CERTIFIED oracle floors (Modal A100-80GB, 2026-06-20): OpenAI 0.10 (seed-robust — clean on seeds 0-4), Gemini 0.25 (seed 0 verified on 2 images; the Gemini oracle rate-limits volume seed-repeat, so PIN a seed in prod). The Gemini floor (0.25) is HIGHER than the certified controlnet Gemini floor (0.15), and `resolve_strength` is shared/pipeline-independent, so pass an explicit `--strength 0.25` for Gemini content on `qwen` until a Qwen-specific ladder is wired into `resolve_strength`.** Fidelity vs controlnet was measured at the certified floors (`scripts/fidelity_metrics.py`), NOT eyeballed — see the metric nuance above: Qwen wins substantial text, controlnet wins faces.
**`controlnet`** (**the DEFAULT pipeline since 2026-06-09** for `invisible`/`all`/`batch` and both engine ctors; `_run_controlnet`, `_load_controlnet_pipeline`) runs `StableDiffusionXLControlNetImg2ImgPipeline` with the SDXL-native canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` (`watermark_profiles.CONTROLNET_CANNY_MODEL`): the control image is `cv2.Canny(gray, 100, 200)` stacked to 3 channels (`_CANNY_LOW`/`_CANNY_HIGH`, prompt `_CONTROLNET_PROMPT` / `_CONTROLNET_NEGATIVE`).