diff --git a/docs/known-limitations.md b/docs/known-limitations.md index 5bf71f7..35fb4b4 100644 --- a/docs/known-limitations.md +++ b/docs/known-limitations.md @@ -132,7 +132,7 @@ See `docs/synthid.md` §5.5 + `docs/controlnet-removal-pipeline-research.md` (ce **Forensic-stealth caveat still applies** (arXiv:2605.09203): defeating the SynthID verifier is not forensic invisibility -- a "this image went through a removal pipeline" classifier can still flag the output. -## `qwen` pipeline (experimental, Qwen-Image 20B, uncertified floors) +## `qwen` pipeline (experimental, Qwen-Image 20B, certified floors) `--pipeline qwen` runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights), as an img2img alternative to the SDXL pipelines. Motivation: the controlnet over-regeneration problem above (it plasticizes real photos / loses fine text at the scrub floor). Qwen-Image renders text natively (incl. CJK) and preserves structure markedly better, so at the strength that removes SynthID it damages real content far less. @@ -142,4 +142,8 @@ The scrub still comes from the img2img `strength` (same lever as SDXL); the call **Fidelity vs controlnet was MEASURED, not eyeballed (`scripts/fidelity_metrics.py`, text scored against a vision-transcribed ground truth in `data/qwen_in/ground_truth.json` + PaddleOCR on the variants; an initial eyeball read was wrong and overturned by the metrics).** Methodology rule: only compare fidelity at each pipeline's OWN oracle-confirmed scrub floor -- i.e. between outputs where SynthID is actually removed in BOTH (controlnet OpenAI 0.10 / Gemini 0.15; Qwen OpenAI 0.10 / Gemini 0.25). An equal-strength comparison is invalid where it leaves one pipeline un-scrubbed (Qwen at 0.15 does NOT clear Gemini SynthID, so that run was dropped). At those scrub floors: - **Text:** Qwen wins on substantial Latin/mixed-script text -- OCR CER, controlnet vs Qwen: openai_1 (EN+RU+ZH, both 0.10) 0.385 vs **0.241**, openai_2 (EN, both 0.10) 0.341 vs **0.290**. On a SHORT CJK sign (gemini_1, cnet 0.15 / Qwen 0.25) it is a TIE (0.037 vs 0.037 -- both near-perfect; the earlier Qwen 0.000 was at the higher 0.30, not the certified floor). -- **Faces:** controlnet wins -- gemini_3, 18 faces (cnet 0.15 / Qwen 0.25): ArcFace identity 0.546 vs 0.382, Laplacian-variance retention 0.62 vs 0.40, face LPIPS 0.09 vs 0.17 (Qwen smooths faces MORE; the gap narrows vs Qwen 0.30 but controlnet still wins clearly). At each pipeline's OWN scrub floor the face gap widens (Qwen's Gemini floor 0.30 is 2x controlnet's 0.15). **Conclusion: Qwen is the better TEXT-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better.** Two caveats before relying on it: (1) near-floor scrub is SEED-NON-DETERMINISTIC (the general known-limitation above), so these single-seed floors are NOT certified — run a seed-repeat sweep before trusting them; (2) `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15, the certified controlnet floor) UNDER-scrubs Gemini on `qwen` (whose floor is ~0.30) — **pass an explicit `--strength` for Gemini content on `qwen`** until a Qwen-specific ladder is certified. Flat-graphic content was not in the prototype sample. +- **Faces:** controlnet wins -- gemini_3, 18 faces (cnet 0.15 / Qwen 0.25): ArcFace identity 0.546 vs 0.382, Laplacian-variance retention 0.62 vs 0.40, face LPIPS 0.09 vs 0.17 (Qwen smooths faces MORE; the gap narrows vs Qwen 0.30 but controlnet still wins clearly). + +**Conclusion: Qwen is the better TEXT-preserving remover (substantial Latin/mixed text), NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better, so the path is a content-routed lane (text→qwen, faces→controlnet), not a blanket migration.** Caveat: `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15) UNDER-scrubs Gemini on `qwen` (floor 0.25) — pass `--strength 0.25` for Gemini on `qwen` until a Qwen ladder is wired. Flat-graphic content was not in the sample. + +**Seed as a quality lever (measured, openai_1 at 0.10, seeds 0-4):** the seed barely moves whole-image fidelity (img LPIPS 0.062-0.065, SSIM 0.855-0.857, PSNR 28.5-28.7 — flat) but does shift TEXT legibility (OCR CER 0.241-0.290, ~17% spread) -- the seed changes WHICH details get regenerated, not the overall level. So a per-image best-of-N-seed selection is a WEAK, text-only lever (pick the lowest-CER seed that still scrubs; fidelity selection needs no oracle). Not worth the N× cost for general use -- pin one decent seed in prod; reserve best-of-N for text-heavy premium cases.