docs: certify qwen scrub floors (OpenAI 0.10 seed-robust, Gemini 0.25)

Oracle seed-repeat + floor refinement (2026-06-20, data/qwen_in): - OpenAI floor 0.10 is SEED-ROBUST: 0.05 and 0.075 still detected; 0.10 clean on seeds 0-4 (5/5) -> a random seed is safe. - Gemini floor lowered 0.30 -> 0.25 (0.20 still detected, 0.25 clean on both images). Single-seed (seed 0): the Gemini oracle rate-limits volume seed-repeat, so pin a seed in prod rather than relying on seed-robustness there. Re-measured fidelity at the certified floors (controlnet 0.15 vs Qwen 0.25 for Gemini): faces still favor controlnet (ArcFace 0.546 vs 0.382, lapvar 0.62 vs 0.40); the short-CJK text case is now a TIE (gemini_1 0.037 vs 0.037 -- the earlier Qwen 0.000 was at 0.30, not the floor). Qwen's text win holds on substantial Latin/mixed text (OpenAI 0.385 vs 0.241 / 0.341 vs 0.290). Update watermark_profiles comment, CLAUDE.md, module-internals, known-limitations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-05 07:57:50 +02:00 · 2026-06-20 15:16:51 -07:00
parent 373b910a60
commit 7dddfef14e
4 changed files with 13 additions and 9 deletions
@@ -138,6 +138,8 @@ See `docs/synthid.md` §5.5 + `docs/controlnet-removal-pipeline-research.md` (ce

 The scrub still comes from the img2img `strength` (same lever as SDXL); the call shape lives in the pure `_build_qwen_kwargs` (uses Qwen's `true_cfg_scale`, not SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it, and ~4.0 is typical vs the SDXL default 7.5). bf16 on CUDA. It is **CUDA/cloud-class — the 20B does not fit MPS — so `_run_qwen` has NO MPS→CPU fallback** (unlike the SDXL paths). Cost on Modal A100-80GB is ~$0.05-0.10/image vs SDXL.

-**Prototype oracle floors (Modal A100-80GB, single seed, 2026-06-19 — PENDING seed-repeat cert):** on native-resolution OpenAI and Gemini cert inputs (both controls SynthID-POSITIVE), OpenAI cleared at strength **0.10** and Gemini at **0.30** (0.20 still detected).
+**Certified oracle floors (Modal A100-80GB, 2026-06-20):** on native-resolution OpenAI and Gemini cert inputs (`data/qwen_in/`, both controls SynthID-POSITIVE): **OpenAI 0.10** (0.05 and 0.075 still detected; 0.10 clean and SEED-ROBUST — clean on seeds 0-4, so a random seed is safe) and **Gemini 0.25** (0.20 still detected, 0.25 clean on both images; lowered from the 0.30 first measured). Gemini seed-repeat is single-seed (seed 0): the Gemini oracle rate-limits volume, so PIN a seed in production rather than relying on seed-robustness there.

-**Fidelity vs controlnet was MEASURED, not eyeballed (`scripts/fidelity_metrics.py`, text scored against a vision-transcribed ground truth in `data/qwen_in/ground_truth.json` + PaddleOCR on the variants; an initial eyeball read was wrong and overturned by the metrics).** Methodology rule: only compare fidelity at each pipeline's OWN oracle-confirmed scrub floor -- i.e. between outputs where SynthID is actually removed in BOTH (controlnet OpenAI 0.10 / Gemini 0.15; Qwen OpenAI 0.10 / Gemini 0.30). An equal-strength comparison is invalid where it leaves one pipeline un-scrubbed (Qwen at 0.15 does NOT clear Gemini SynthID, so that run was dropped). At those scrub floors: Qwen wins **text** across EN/RU/ZH -- OCR CER, controlnet vs Qwen: openai_1 (both 0.10) 0.385 vs 0.241, openai_2 (both 0.10) 0.341 vs 0.290, gemini_1 (cnet 0.15 / Qwen 0.30) 0.037 vs **0.000** (perfect Chinese even at the higher strength). controlnet wins **faces** -- gemini_3, 18 faces (cnet 0.15 / Qwen 0.30): ArcFace identity 0.546 vs 0.331, Laplacian-variance retention 0.62 vs 0.40, face LPIPS 0.09 vs 0.19 (Qwen smooths faces MORE). At each pipeline's OWN scrub floor the face gap widens (Qwen's Gemini floor 0.30 is 2x controlnet's 0.15). **Conclusion: Qwen is the better TEXT-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better.** Two caveats before relying on it: (1) near-floor scrub is SEED-NON-DETERMINISTIC (the general known-limitation above), so these single-seed floors are NOT certified — run a seed-repeat sweep before trusting them; (2) `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15, the certified controlnet floor) UNDER-scrubs Gemini on `qwen` (whose floor is ~0.30) — **pass an explicit `--strength` for Gemini content on `qwen`** until a Qwen-specific ladder is certified. Flat-graphic content was not in the prototype sample.
+**Fidelity vs controlnet was MEASURED, not eyeballed (`scripts/fidelity_metrics.py`, text scored against a vision-transcribed ground truth in `data/qwen_in/ground_truth.json` + PaddleOCR on the variants; an initial eyeball read was wrong and overturned by the metrics).** Methodology rule: only compare fidelity at each pipeline's OWN oracle-confirmed scrub floor -- i.e. between outputs where SynthID is actually removed in BOTH (controlnet OpenAI 0.10 / Gemini 0.15; Qwen OpenAI 0.10 / Gemini 0.25). An equal-strength comparison is invalid where it leaves one pipeline un-scrubbed (Qwen at 0.15 does NOT clear Gemini SynthID, so that run was dropped). At those scrub floors:
+- **Text:** Qwen wins on substantial Latin/mixed-script text -- OCR CER, controlnet vs Qwen: openai_1 (EN+RU+ZH, both 0.10) 0.385 vs **0.241**, openai_2 (EN, both 0.10) 0.341 vs **0.290**. On a SHORT CJK sign (gemini_1, cnet 0.15 / Qwen 0.25) it is a TIE (0.037 vs 0.037 -- both near-perfect; the earlier Qwen 0.000 was at the higher 0.30, not the certified floor).
+- **Faces:** controlnet wins -- gemini_3, 18 faces (cnet 0.15 / Qwen 0.25): ArcFace identity 0.546 vs 0.382, Laplacian-variance retention 0.62 vs 0.40, face LPIPS 0.09 vs 0.17 (Qwen smooths faces MORE; the gap narrows vs Qwen 0.30 but controlnet still wins clearly). At each pipeline's OWN scrub floor the face gap widens (Qwen's Gemini floor 0.30 is 2x controlnet's 0.15). **Conclusion: Qwen is the better TEXT-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better.** Two caveats before relying on it: (1) near-floor scrub is SEED-NON-DETERMINISTIC (the general known-limitation above), so these single-seed floors are NOT certified — run a seed-repeat sweep before trusting them; (2) `resolve_strength` is shared and pipeline-independent, so the Gemini default (0.15, the certified controlnet floor) UNDER-scrubs Gemini on `qwen` (whose floor is ~0.30) — **pass an explicit `--strength` for Gemini content on `qwen`** until a Qwen-specific ladder is certified. Flat-graphic content was not in the prototype sample.
@@ -181,7 +181,7 @@ Root cause: bad alpha (under-estimated, max ~0.65) + fixed-no-inpaint + tight bo

 **`sdxl`** (renamed from `default` 2026-06-09; `default` kept as a back-compat alias via `normalize_profile`) runs plain SDXL img2img (`_run_img2img`); it is the lighter opt-down alternative (no ControlNet weights).

-**`qwen`** (`_run_qwen`, `_load_qwen_pipeline`) runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights). The scrub still comes from the img2img `strength`; Qwen's value is **text preservation** (incl. CJK and small text). **Metric-measured nuance (2026-06-19, `scripts/fidelity_metrics.py`, do NOT trust the eyeball here — it misled). Compare ONLY at each pipeline's oracle-confirmed scrub floor (outputs where SynthID is removed in BOTH — an equal-strength compare is invalid where it leaves one un-scrubbed; Qwen at 0.15 does not clear Gemini): Qwen wins TEXT (lower OCR CER across EN/RU/ZH, perfect Chinese) but controlnet wins FACES (higher Laplacian-variance retention and lower LPIPS — Qwen smooths faces MORE; ArcFace identity favors controlnet 0.546 vs 0.331 at the Gemini floors).** So Qwen is the better text-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better. Specifics: bf16 on CUDA (fp16 risks overflow on the 20B MMDiT — see the dtype branch in `__init__`); loads `QWEN_MODEL_ID` unless `--model` is overridden; the call shape lives in the pure module helper `_build_qwen_kwargs` (unit-tested without torch in `tests/test_platform.py::TestQwenKwargs`), which uses Qwen's `true_cfg_scale` (NOT SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it; ~4.0 is typical, the SDXL default 7.5 is high for Qwen) and an explicit `negative_prompt` (`_QWEN_PROMPT`/`_QWEN_NEGATIVE`). It is CUDA/cloud-class (the 20B does not fit MPS), so `_run_qwen` has NO MPS->CPU fallback — an error propagates. `_load_qwen_pipeline` raises a clear ImportError if the installed diffusers lacks `QwenImageImg2ImgPipeline`. **Prototype oracle floors (Modal A100-80GB, single seed, 2026-06-19, PENDING seed-repeat cert): OpenAI clears at strength ~0.10, Gemini at ~0.30 (0.20 still detected) — both controls were SynthID-positive. The Gemini floor (0.30) is HIGHER than the certified controlnet Gemini floor (0.15), and `resolve_strength` is shared/pipeline-independent, so pass an explicit `--strength` for Gemini content on `qwen` until a Qwen-specific ladder is certified.** Fidelity vs controlnet was measured (`scripts/fidelity_metrics.py`), NOT eyeballed — see the metric nuance above: Qwen wins text, controlnet wins faces.
+**`qwen`** (`_run_qwen`, `_load_qwen_pipeline`) runs `QwenImageImg2ImgPipeline` on `Qwen/Qwen-Image` (20B MMDiT, Apache-2.0 code AND weights). The scrub still comes from the img2img `strength`; Qwen's value is **text preservation** (incl. CJK and small text). **Metric-measured nuance (2026-06-19, `scripts/fidelity_metrics.py`, do NOT trust the eyeball here — it misled). Compare ONLY at each pipeline's oracle-confirmed scrub floor (outputs where SynthID is removed in BOTH — an equal-strength compare is invalid where it leaves one un-scrubbed; Qwen at 0.15 does not clear Gemini): Qwen wins TEXT (lower OCR CER across EN/RU/ZH, perfect Chinese) but controlnet wins FACES (higher Laplacian-variance retention and lower LPIPS — Qwen smooths faces MORE; ArcFace identity favors controlnet 0.546 vs 0.331 at the Gemini floors).** So Qwen is the better text-preserving remover, NOT a universal fidelity win — controlnet's canny edge map holds face skin detail better. Specifics: bf16 on CUDA (fp16 risks overflow on the 20B MMDiT — see the dtype branch in `__init__`); loads `QWEN_MODEL_ID` unless `--model` is overridden; the call shape lives in the pure module helper `_build_qwen_kwargs` (unit-tested without torch in `tests/test_platform.py::TestQwenKwargs`), which uses Qwen's `true_cfg_scale` (NOT SDXL's `guidance_scale` — the CLI `--guidance-scale` maps onto it; ~4.0 is typical, the SDXL default 7.5 is high for Qwen) and an explicit `negative_prompt` (`_QWEN_PROMPT`/`_QWEN_NEGATIVE`). It is CUDA/cloud-class (the 20B does not fit MPS), so `_run_qwen` has NO MPS->CPU fallback — an error propagates. `_load_qwen_pipeline` raises a clear ImportError if the installed diffusers lacks `QwenImageImg2ImgPipeline`. **CERTIFIED oracle floors (Modal A100-80GB, 2026-06-20): OpenAI 0.10 (seed-robust — clean on seeds 0-4), Gemini 0.25 (seed 0 verified on 2 images; the Gemini oracle rate-limits volume seed-repeat, so PIN a seed in prod). The Gemini floor (0.25) is HIGHER than the certified controlnet Gemini floor (0.15), and `resolve_strength` is shared/pipeline-independent, so pass an explicit `--strength 0.25` for Gemini content on `qwen` until a Qwen-specific ladder is wired into `resolve_strength`.** Fidelity vs controlnet was measured at the certified floors (`scripts/fidelity_metrics.py`), NOT eyeballed — see the metric nuance above: Qwen wins substantial text, controlnet wins faces.

 **`controlnet`** (**the DEFAULT pipeline since 2026-06-09** for `invisible`/`all`/`batch` and both engine ctors; `_run_controlnet`, `_load_controlnet_pipeline`) runs `StableDiffusionXLControlNetImg2ImgPipeline` with the SDXL-native canny ControlNet `xinsir/controlnet-canny-sdxl-1.0` (`watermark_profiles.CONTROLNET_CANNY_MODEL`): the control image is `cv2.Canny(gray, 100, 200)` stacked to 3 channels (`_CANNY_LOW`/`_CANNY_HIGH`, prompt `_CONTROLNET_PROMPT` / `_CONTROLNET_NEGATIVE`).