diff --git a/docs/known-limitations.md b/docs/known-limitations.md index 35fb4b4..095e958 100644 --- a/docs/known-limitations.md +++ b/docs/known-limitations.md @@ -146,4 +146,6 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call **Conclusion: Qwen is the better TEXT-preserving remover (substantial Latin/mixed text), NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better, so the path is a content-routed lane (text→qwen, faces→controlnet), not a blanket migration.** Caveat: `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15) UNDER-scrubs Gemini on `qwen` (floor 0.25) — pass `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired. Flat-graphic content was not in the sample. +**Improving Qwen (ship vs improve):** the cited research on fixing the face-smoothing while keeping the text win (Qwen-Image ControlNet for structure conditioning, Qwen-Image-Edit, Z-Image-Turbo as a cheaper text-preserving substitute, non-regenerative detail restoration) lives in `docs/qwen-improvement-research.md` -- read it before extending the `qwen` pipeline. Verdict: shippable now as an opt-in text lane; the strongest improvement lead is adding a Qwen-Image ControlNet, but no improvement has measured face-fidelity at our floors yet (validate with `scripts/fidelity_metrics.py` first). + **Seed as a quality lever (measured, openai_1 at 0.10, seeds 0-4):** the seed barely moves whole-image fidelity (img LPIPS 0.062-0.065, SSIM 0.855-0.857, PSNR 28.5-28.7 — flat) but does shift TEXT legibility (OCR CER 0.241-0.290, ~17% spread) -- the seed changes WHICH details get regenerated, not the overall level. So a per-image best-of-N-seed selection is a WEAK, text-only lever (pick the lowest-CER seed that still scrubs; fidelity selection needs no oracle). Not worth the N× cost for general use -- pin one decent seed in prod; reserve best-of-N for text-heavy premium cases. diff --git a/docs/qwen-improvement-research.md b/docs/qwen-improvement-research.md new file mode 100644 index 0000000..b75688e --- /dev/null +++ b/docs/qwen-improvement-research.md @@ -0,0 +1,153 @@ +# Qwen-Image improvement research (2026-06-20) + +Cited research behind the decision **"ship the `qwen` pipeline as-is, or improve it +first?"** Produced by the multi-source deep-research harness (5 search angles, 22 +sources fetched, 85 claims extracted, 25 verified by a 3-vote adversarial check, 20 +confirmed / 5 killed, 104 agent calls). Findings carry their confidence and vote. + +## Context + +The `qwen` pipeline runs base Qwen-Image (20B MMDiT, Apache-2.0) as a low-strength +img2img scrub (removal comes from the denoising `strength`). Certified oracle scrub +floors: OpenAI 0.10 (seed-robust), Gemini 0.25 (pinned seed). Measured against the +SDXL + canny-ControlNet pipeline (`scripts/fidelity_metrics.py`): Qwen preserves +**text** markedly better (incl. CJK and Cyrillic, lower OCR CER) but preserves +**faces** worse, smoothing skin (Laplacian-variance retention 0.40 vs 0.62, face +LPIPS 0.17 vs 0.09, ArcFace identity 0.38 vs 0.55 at the scrub floors). The goal of +the research: keep Qwen's text advantage while fixing the face-smoothing, and judge +production-readiness. + +## Verdict + +Base Qwen-Image is **shippable now as an opt-in text-content lane** (Apache-2.0 on +code and weights, scrub lever confirmed), but it is not a universal upgrade (it loses +faces). The strongest verified improvement path is to **add structure conditioning** +(a Qwen-Image ControlNet) to the existing base pass, the direct analog of the SDXL + +canny conditioning that wins on faces. Separately, **Z-Image / Z-Image-Turbo** (6B, +Apache-2.0) is the best-verified lighter alternative to evaluate before committing to +the 20B cost. None of the improvements has measured face-fidelity numbers at our +scrub floors yet, so each must be validated with `scripts/fidelity_metrics.py` plus +the oracle before shipping. + +## Findings + +1. **[high, 3-0] A permissively-licensed Qwen-Image ControlNet exists today and is + CUDA/diffusers-runnable.** InstantX Qwen-Image-ControlNet-Union supports + canny/soft-edge/depth/pose; DiffSynth-Studio maintains blockwise Canny/Depth/Inpaint + plus an In-Context-Control-Union; diffusers exposes `QwenImageControlNetPipeline` + and `QwenImageMultiControlNetModel` with `controlnet_conditioning_scale` (default + 1.0) and `control_guidance_start`/`end`. This is the direct analog of the certified + SDXL+canny structure conditioning that wins on faces. Caveat: canny/depth preserve + geometric structure, not face identity per se, and none is a **tile**-ControlNet + (the variant most tied to fine-detail/skin retention in the SDXL world). + Sources: InstantX/Qwen-Image-ControlNet-Union, InstantX/Qwen-Image-ControlNet-Inpainting, + DiffSynth-Studio Qwen-Image docs, diffusers qwenimage pipeline docs. + +2. **[high, 3-0] The scrub mechanism is preserved, and the license is clean.** + `QwenImageImg2ImgPipeline.strength` (default 0.6, range 0-1; DiffSynth names it + `denoising_strength`) keeps the partial-regeneration scrub the project relies on, + lower values staying closer to the input. Qwen-Image and Qwen-Image-Edit-2509 are + Apache-2.0 on both code and weights. + +3. **[medium, mixed 2-1 / 3-0] Qwen-Image-Edit improves identity consistency, but that + is not proof it fixes our metric.** The instruction-edit pipeline (2511 better than + 2509) improves identity/character consistency, but only for identity *through edits* + of an input portrait, which is not the same as measured face-skin Laplacian/LPIPS + fidelity at a low scrub strength. Architecture: 20B base + Qwen2.5-VL (semantic + control) + VAE Encoder (appearance control). Several stronger edit-model face claims + were refuted (see below). + +4. **[high, 3-0] Z-Image / Z-Image-Turbo is the best-verified lighter alternative.** A + 6B model (~1/3 of Qwen-Image's 20B), Apache-2.0 on code and weights, strong bilingual + (Chinese + English) native text rendering, with an official diffusers + `ZImageImg2ImgPipeline` exposing the same 0-1 denoising-strength scrub lever; Turbo + runs at ~8 steps (guidance_scale=0.0) vs ~40. A material cost/footprint reduction vs + 20B/A100-80GB (but see caveat 4 on the refuted consumer-GPU claim). + +5. **[high, 3-0] EliGen-V2 is NOT relevant** to the face-smoothing problem. It is an + entity-level/regional control model (LoRA + regional attention placing entities via + text + mask maps, plus entity-level inpainting); it provides no + ControlNet/canny/depth/tile structure conditioning or face-skin-detail retention. + +6. **[medium, 2-1] flymy-ai/qwen-image-realism-lora** is Apache-2.0 (code+weights) on + base Qwen-Image, so it is permissively usable with the existing base img2img pass, + but it is NOT verified to specifically fix the face/skin-smoothing failure mode. + +## Caveats + +1. The research did NOT surface verified evidence for two things specifically asked: + (a) a Qwen-Image **tile**-ControlNet (the variant most tied to fine-detail/skin + retention; only canny/soft-edge/depth/pose/inpaint were confirmed), and (b) any + **non-regenerative detail-restoration** technique (high-frequency residual transfer, + guided filtering) that recovers smoothed faces without re-introducing the watermark. + Research angle 4 produced zero surviving claims, so it is unanswered. +2. No claim provides measured face-fidelity numbers (ArcFace/LPIPS/Laplacian) for ANY + recommended intervention at the project's scrub floors. All fidelity evidence is the + project's own internal measurement. The improvements are mechanistically sound but + unproven for this exact metric, so validate with `scripts/fidelity_metrics.py` + before shipping. +3. Several vendor model cards are marketing-register primary sources (Qwen blog, + Z-Image card). Load-bearing facts (license, params, API levers) are independently + corroborated, but comparative quality framings are author glosses. +4. Z-Image's "sub-second" figure is H800-specific and author-benchmarked; consumer-GPU + third-party benchmarks are still limited (seconds, not sub-second, though within the + <16GB envelope). +5. Time-sensitivity: Qwen-Image-Edit-2511 and Z-Image are late-2025/2026 releases; the + diffusers pipelines cited are on the main/dev branch, so confirm released-version + availability before pinning. +6. Five claims were refuted (below), clustering on over-strong edit-model face-fidelity + and one over-strong Z-Image cost claim. + +## Open questions + +- Does a Qwen-Image **tile-ControlNet** (or equivalent high-resolution detail + conditioning) exist under a permissive license? +- What **non-regenerative detail-restoration** method recovers smoothed faces WITHOUT + re-introducing SynthID? Note: residual transfer from the ORIGINAL risks copying back + watermark-carrying high frequencies, so it must be verified against the SynthID oracle. +- Does adding Qwen-Image-ControlNet (canny/depth) at the certified floors (OpenAI 0.10, + Gemini 0.25) actually raise face Laplacian/LPIPS toward the SDXL+ControlNet numbers + (0.62 / 0.09) WITHOUT re-introducing SynthID, or does the structure constraint + preserve the watermark the way ControlNet can on photoreal content (the existing + "SynthID CAN survive controlnet at low strength" caveat)? +- Head-to-head: does Z-Image-Turbo at its scrub floor match Qwen's text advantage + (CJK+Cyrillic CER) while not worsening faces, and what are Z-Image's own SynthID + scrub floors and seed-robustness (none exist yet)? + +## Refuted claims (do NOT rely on these) + +- [0-3] "Qwen-Image-Edit-2511 specifically targets/mitigates image drift, the same + failure mode as face-detail loss in a low-strength scrub." (qwen.ai/blog, 2511) +- [0-3] "Qwen-Image-Edit-2509 explicitly improves facial identity preservation and + supports portrait styles and pose transformations." (HF Qwen-Image-Edit-2509) +- [0-3] "Qwen-Image-Edit-2509 has native built-in ControlNet support (depth/edge/ + keypoint)." (HF Qwen-Image-Edit-2509) +- [1-2] "flymy realism LoRA specifically targets facial and skin detail, the exact + failure mode." (HF flymy-ai/qwen-image-realism-lora) +- [0-3] "Z-Image-Turbo runs on consumer 16GB-VRAM hardware, far below the A100-80GB of + Qwen-Image 20B, materially lowering per-image cost." (HF Tongyi-MAI/Z-Image-Turbo) + +## Sources + +1. https://qwen.ai/blog?id=qwen-image-edit-2511 +2. https://qwenlm.github.io/blog/qwen-image-edit/ +3. https://docs.comfy.org/tutorials/image/qwen/qwen-image-edit +4. https://github.com/FurkanGozukara/Stable-Diffusion/wiki/Qwen-Image-Edit-2511-Free-and-Open-Source-Crushes-Qwen-Image-Edit-2509-and-Challenges-Nano-Banana-Pro +5. https://myaiforce.com/qie-2511/ +6. https://huggingface.co/Qwen/Qwen-Image-Edit-2509 +7. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Union +8. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Inpainting +9. https://huggingface.co/DiffSynth-Studio/Qwen-Image-EliGen-V2 +10. https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Model_Details/Qwen-Image.md +11. https://blog.comfy.org/p/day-1-support-of-qwen-image-instantx +12. https://learn.thinkdiffusion.com/how-to-use-qwen-image-with-instantx-union-controlnet-in-comfyui-guide-workflow/ +13. https://huggingface.co/flymy-ai/qwen-image-realism-lora +14. https://huggingface.co/lightx2v/Qwen-Image-Lightning/discussions/4 +15. https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwenimage +16. https://www.diyphotography.net/skin-retouching-technique-frequency-separation/ +17. https://link.springer.com/content/pdf/10.1007/978-3-642-15549-9_1.pdf +18. https://github.com/ShieldMnt/invisible-watermark/wiki/Frequency-Methods +19. https://huggingface.co/Tongyi-MAI/Z-Image-Turbo +20. https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/z_image/pipeline_z_image_img2img.py +21. https://arxiv.org/pdf/2511.22699 +22. https://github.com/ModelTC/LightX2V-Qwen-Image-Lightning