docs: capture the Qwen-improvement research (ship vs improve)

Cited deep-research report (22 sources, 3-vote adversarial verification, 5 refuted) behind the "ship qwen as-is or improve first?" decision. Verdict: shippable now as an opt-in text lane; strongest improvement lead is adding a Qwen-Image ControlNet (InstantX / DiffSynth, Apache-2.0, diffusers QwenImageControlNetPipeline) for face/ skin structure; Z-Image-Turbo (6B, Apache-2.0) is the best cheaper text-preserving substitute. No improvement has measured face-fidelity at our scrub floors yet -- validate with scripts/fidelity_metrics.py first. Linked from known-limitations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-04 23:47:49 +02:00 · 2026-06-20 15:58:46 -07:00
parent 0d9033d63a
commit 8f64869bfc
2 changed files with 155 additions and 0 deletions
@@ -146,4 +146,6 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call

 **Conclusion: Qwen is the better TEXT-preserving remover (substantial Latin/mixed text), NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better, so the path is a content-routed lane (text→qwen, faces→controlnet), not a blanket migration.** Caveat: `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15) UNDER-scrubs Gemini on `qwen` (floor 0.25) — pass `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired. Flat-graphic content was not in the sample.

+**Improving Qwen (ship vs improve):** the cited research on fixing the face-smoothing while keeping the text win (Qwen-Image ControlNet for structure conditioning, Qwen-Image-Edit, Z-Image-Turbo as a cheaper text-preserving substitute, non-regenerative detail restoration) lives in `docs/qwen-improvement-research.md` -- read it before extending the `qwen` pipeline. Verdict: shippable now as an opt-in text lane; the strongest improvement lead is adding a Qwen-Image ControlNet, but no improvement has measured face-fidelity at our floors yet (validate with `scripts/fidelity_metrics.py` first).
+
 **Seed as a quality lever (measured, openai_1 at 0.10, seeds 0-4):** the seed barely moves whole-image fidelity (img LPIPS 0.062-0.065, SSIM 0.855-0.857, PSNR 28.5-28.7 — flat) but does shift TEXT legibility (OCR CER 0.241-0.290, ~17% spread) -- the seed changes WHICH details get regenerated, not the overall level. So a per-image best-of-N-seed selection is a WEAK, text-only lever (pick the lowest-CER seed that still scrubs; fidelity selection needs no oracle). Not worth the N× cost for general use -- pin one decent seed in prod; reserve best-of-N for text-heavy premium cases.
@@ -0,0 +1,153 @@
+# Qwen-Image improvement research (2026-06-20)
+
+Cited research behind the decision **"ship the `qwen` pipeline as-is, or improve it
+first?"** Produced by the multi-source deep-research harness (5 search angles, 22
+sources fetched, 85 claims extracted, 25 verified by a 3-vote adversarial check, 20
+confirmed / 5 killed, 104 agent calls). Findings carry their confidence and vote.
+
+## Context
+
+The `qwen` pipeline runs base Qwen-Image (20B MMDiT, Apache-2.0) as a low-strength
+img2img scrub (removal comes from the denoising `strength`). Certified oracle scrub
+floors: OpenAI 0.10 (seed-robust), Gemini 0.25 (pinned seed). Measured against the
+SDXL + canny-ControlNet pipeline (`scripts/fidelity_metrics.py`): Qwen preserves
+**text** markedly better (incl. CJK and Cyrillic, lower OCR CER) but preserves
+**faces** worse, smoothing skin (Laplacian-variance retention 0.40 vs 0.62, face
+LPIPS 0.17 vs 0.09, ArcFace identity 0.38 vs 0.55 at the scrub floors). The goal of
+the research: keep Qwen's text advantage while fixing the face-smoothing, and judge
+production-readiness.
+
+## Verdict
+
+Base Qwen-Image is **shippable now as an opt-in text-content lane** (Apache-2.0 on
+code and weights, scrub lever confirmed), but it is not a universal upgrade (it loses
+faces). The strongest verified improvement path is to **add structure conditioning**
+(a Qwen-Image ControlNet) to the existing base pass, the direct analog of the SDXL +
+canny conditioning that wins on faces. Separately, **Z-Image / Z-Image-Turbo** (6B,
+Apache-2.0) is the best-verified lighter alternative to evaluate before committing to
+the 20B cost. None of the improvements has measured face-fidelity numbers at our
+scrub floors yet, so each must be validated with `scripts/fidelity_metrics.py` plus
+the oracle before shipping.
+
+## Findings
+
+1. **[high, 3-0] A permissively-licensed Qwen-Image ControlNet exists today and is
+   CUDA/diffusers-runnable.** InstantX Qwen-Image-ControlNet-Union supports
+   canny/soft-edge/depth/pose; DiffSynth-Studio maintains blockwise Canny/Depth/Inpaint
+   plus an In-Context-Control-Union; diffusers exposes `QwenImageControlNetPipeline`
+   and `QwenImageMultiControlNetModel` with `controlnet_conditioning_scale` (default
+   1.0) and `control_guidance_start`/`end`. This is the direct analog of the certified
+   SDXL+canny structure conditioning that wins on faces. Caveat: canny/depth preserve
+   geometric structure, not face identity per se, and none is a **tile**-ControlNet
+   (the variant most tied to fine-detail/skin retention in the SDXL world).
+   Sources: InstantX/Qwen-Image-ControlNet-Union, InstantX/Qwen-Image-ControlNet-Inpainting,
+   DiffSynth-Studio Qwen-Image docs, diffusers qwenimage pipeline docs.
+
+2. **[high, 3-0] The scrub mechanism is preserved, and the license is clean.**
+   `QwenImageImg2ImgPipeline.strength` (default 0.6, range 0-1; DiffSynth names it
+   `denoising_strength`) keeps the partial-regeneration scrub the project relies on,
+   lower values staying closer to the input. Qwen-Image and Qwen-Image-Edit-2509 are
+   Apache-2.0 on both code and weights.
+
+3. **[medium, mixed 2-1 / 3-0] Qwen-Image-Edit improves identity consistency, but that
+   is not proof it fixes our metric.** The instruction-edit pipeline (2511 better than
+   2509) improves identity/character consistency, but only for identity *through edits*
+   of an input portrait, which is not the same as measured face-skin Laplacian/LPIPS
+   fidelity at a low scrub strength. Architecture: 20B base + Qwen2.5-VL (semantic
+   control) + VAE Encoder (appearance control). Several stronger edit-model face claims
+   were refuted (see below).
+
+4. **[high, 3-0] Z-Image / Z-Image-Turbo is the best-verified lighter alternative.** A
+   6B model (~1/3 of Qwen-Image's 20B), Apache-2.0 on code and weights, strong bilingual
+   (Chinese + English) native text rendering, with an official diffusers
+   `ZImageImg2ImgPipeline` exposing the same 0-1 denoising-strength scrub lever; Turbo
+   runs at ~8 steps (guidance_scale=0.0) vs ~40. A material cost/footprint reduction vs
+   20B/A100-80GB (but see caveat 4 on the refuted consumer-GPU claim).
+
+5. **[high, 3-0] EliGen-V2 is NOT relevant** to the face-smoothing problem. It is an
+   entity-level/regional control model (LoRA + regional attention placing entities via
+   text + mask maps, plus entity-level inpainting); it provides no
+   ControlNet/canny/depth/tile structure conditioning or face-skin-detail retention.
+
+6. **[medium, 2-1] flymy-ai/qwen-image-realism-lora** is Apache-2.0 (code+weights) on
+   base Qwen-Image, so it is permissively usable with the existing base img2img pass,
+   but it is NOT verified to specifically fix the face/skin-smoothing failure mode.
+
+## Caveats
+
+1. The research did NOT surface verified evidence for two things specifically asked:
+   (a) a Qwen-Image **tile**-ControlNet (the variant most tied to fine-detail/skin
+   retention; only canny/soft-edge/depth/pose/inpaint were confirmed), and (b) any
+   **non-regenerative detail-restoration** technique (high-frequency residual transfer,
+   guided filtering) that recovers smoothed faces without re-introducing the watermark.
+   Research angle 4 produced zero surviving claims, so it is unanswered.
+2. No claim provides measured face-fidelity numbers (ArcFace/LPIPS/Laplacian) for ANY
+   recommended intervention at the project's scrub floors. All fidelity evidence is the
+   project's own internal measurement. The improvements are mechanistically sound but
+   unproven for this exact metric, so validate with `scripts/fidelity_metrics.py`
+   before shipping.
+3. Several vendor model cards are marketing-register primary sources (Qwen blog,
+   Z-Image card). Load-bearing facts (license, params, API levers) are independently
+   corroborated, but comparative quality framings are author glosses.
+4. Z-Image's "sub-second" figure is H800-specific and author-benchmarked; consumer-GPU
+   third-party benchmarks are still limited (seconds, not sub-second, though within the
+   <16GB envelope).
+5. Time-sensitivity: Qwen-Image-Edit-2511 and Z-Image are late-2025/2026 releases; the
+   diffusers pipelines cited are on the main/dev branch, so confirm released-version
+   availability before pinning.
+6. Five claims were refuted (below), clustering on over-strong edit-model face-fidelity
+   and one over-strong Z-Image cost claim.
+
+## Open questions
+
+- Does a Qwen-Image **tile-ControlNet** (or equivalent high-resolution detail
+  conditioning) exist under a permissive license?
+- What **non-regenerative detail-restoration** method recovers smoothed faces WITHOUT
+  re-introducing SynthID? Note: residual transfer from the ORIGINAL risks copying back
+  watermark-carrying high frequencies, so it must be verified against the SynthID oracle.
+- Does adding Qwen-Image-ControlNet (canny/depth) at the certified floors (OpenAI 0.10,
+  Gemini 0.25) actually raise face Laplacian/LPIPS toward the SDXL+ControlNet numbers
+  (0.62 / 0.09) WITHOUT re-introducing SynthID, or does the structure constraint
+  preserve the watermark the way ControlNet can on photoreal content (the existing
+  "SynthID CAN survive controlnet at low strength" caveat)?
+- Head-to-head: does Z-Image-Turbo at its scrub floor match Qwen's text advantage
+  (CJK+Cyrillic CER) while not worsening faces, and what are Z-Image's own SynthID
+  scrub floors and seed-robustness (none exist yet)?
+
+## Refuted claims (do NOT rely on these)
+
+- [0-3] "Qwen-Image-Edit-2511 specifically targets/mitigates image drift, the same
+  failure mode as face-detail loss in a low-strength scrub." (qwen.ai/blog, 2511)
+- [0-3] "Qwen-Image-Edit-2509 explicitly improves facial identity preservation and
+  supports portrait styles and pose transformations." (HF Qwen-Image-Edit-2509)
+- [0-3] "Qwen-Image-Edit-2509 has native built-in ControlNet support (depth/edge/
+  keypoint)." (HF Qwen-Image-Edit-2509)
+- [1-2] "flymy realism LoRA specifically targets facial and skin detail, the exact
+  failure mode." (HF flymy-ai/qwen-image-realism-lora)
+- [0-3] "Z-Image-Turbo runs on consumer 16GB-VRAM hardware, far below the A100-80GB of
+  Qwen-Image 20B, materially lowering per-image cost." (HF Tongyi-MAI/Z-Image-Turbo)
+
+## Sources
+
+1. https://qwen.ai/blog?id=qwen-image-edit-2511
+2. https://qwenlm.github.io/blog/qwen-image-edit/
+3. https://docs.comfy.org/tutorials/image/qwen/qwen-image-edit
+4. https://github.com/FurkanGozukara/Stable-Diffusion/wiki/Qwen-Image-Edit-2511-Free-and-Open-Source-Crushes-Qwen-Image-Edit-2509-and-Challenges-Nano-Banana-Pro
+5. https://myaiforce.com/qie-2511/
+6. https://huggingface.co/Qwen/Qwen-Image-Edit-2509
+7. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Union
+8. https://huggingface.co/InstantX/Qwen-Image-ControlNet-Inpainting
+9. https://huggingface.co/DiffSynth-Studio/Qwen-Image-EliGen-V2
+10. https://github.com/modelscope/DiffSynth-Studio/blob/main/docs/en/Model_Details/Qwen-Image.md
+11. https://blog.comfy.org/p/day-1-support-of-qwen-image-instantx
+12. https://learn.thinkdiffusion.com/how-to-use-qwen-image-with-instantx-union-controlnet-in-comfyui-guide-workflow/
+13. https://huggingface.co/flymy-ai/qwen-image-realism-lora
+14. https://huggingface.co/lightx2v/Qwen-Image-Lightning/discussions/4
+15. https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwenimage
+16. https://www.diyphotography.net/skin-retouching-technique-frequency-separation/
+17. https://link.springer.com/content/pdf/10.1007/978-3-642-15549-9_1.pdf
+18. https://github.com/ShieldMnt/invisible-watermark/wiki/Frequency-Methods
+19. https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
+20. https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/z_image/pipeline_z_image_img2img.py
+21. https://arxiv.org/pdf/2511.22699
+22. https://github.com/ModelTC/LightX2V-Qwen-Image-Lightning