test(c2pa): add real FLUX.2 BFL C2PA fixtures (PNG + JPEG)

flux-1.png / flux-1.jpg are real Black Forest Labs FLUX.2 [pro] Playground outputs (signed C2PA, issuer "Black Forest Labs" + trainedAlgorithmicMedia, manifests verified to contain no personal data). flux-1.jpg is the first committed JPEG-with-C2PA fixture, exercising the c2pa-python non-PNG reader path end to end. Regression tests assert both attribute to "Black Forest Labs (FLUX)". Also documents the verified finding (n=2, 2026-06-19): BFL's hosted output carries the signed C2PA manifest but NOT the open invisible-watermark DWT-DCT (decodes to degenerate all-ones, chance-level vs the FLUX reference) -- the open pixel mark is dev-inference-code-optional only. So a hosted FLUX.2 image is identified by C2PA alone, with no open-pixel fallback once C2PA is stripped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-05 07:57:50 +02:00 · 2026-06-19 09:37:40 -07:00
parent d4d9429328
commit 9e307d020e
6 changed files with 23 additions and 2 deletions
@@ -36,7 +36,7 @@ Consequences for contributors (do not drift back into the stock niche just becau
 - Run `uv run` from the repo root — from another cwd it falls back to a bare env without numpy/cv2/torch.
 - **Stale `trustmark` remnant in site-packages after an extras change:** the `trustmark` package downloads model weights INTO its own package dir, so when a narrower `uv sync` prunes the package, a `trustmark/models/` directory survives as an empty namespace package. Symptom: pyright `"TrustMark" is unknown import symbol` on `trustmark_detector.py` and `find_spec("trustmark")` returning a loader-less spec (so `is_available()` lies True). Fix: `rm -rf .venv/lib/python3.12/site-packages/trustmark` (regenerable weights cache).
 - To add a dev tool (pytest/ruff/pyright) into the env, use `uv sync --frozen --extra dev --extra gpu`, **never `uv pip install`** — `uv pip install` re-resolves and rewrites `uv.lock`, which silently bumped `transformers` to a build incompatible with the pinned `diffusers` (`cannot import name 'Qwen3VLForConditionalGeneration'`) and broke every `identify`/metadata import. Recovery: `git checkout uv.lock && uv sync --frozen --extra gpu --extra dev`. The `gpu` extra holds `diffusers`/`transformers`/`torch`, so a bare `uv sync` (no extras) removes them; `noai/__init__` is now **lazy** (PEP 562 `__getattr__`, so importing `identify`/`metadata` no longer pulls `watermark_remover`/torch), so a bare env breaks only when the removal pipeline is actually invoked, not on import. `maintain.sh`'s `uv sync --all-extras` also pulls the heavy `trustmark`/`lama` wheels (pytorch-lightning, onnxruntime) — fine on a good connection, but on flaky DNS sync only `--extra gpu --extra dev` and run the lint/test steps by hand.
- Metadata/C2PA tests assert against real committed fixtures in `data/samples/` (`chatgpt-*.png` = OpenAI C2PA, `firefly-1.png` = Adobe, `mj-*` = Midjourney IPTC, `doubao-1.png` = ByteDance Doubao with the China TC260 `<TC260:AIGC>` XMP label **and** a visible "豆包AI生成" text mark bottom-right; `grok-1.jpg` = xAI Grok with its EXIF-only `Signature:` blob + UUID `Artist` and no C2PA/SynthID/IPTC); synthetic byte blobs cover the JPEG/ISOBMFF format paths. The "non-AI / clean photo" control is no longer in `data/samples/` -- the `clean_photo` conftest fixture serves a verified-negative image from the corpus `neg/` set (skips if the corpus is absent).
+- Metadata/C2PA tests assert against real committed fixtures in `data/samples/` (`chatgpt-*.png` = OpenAI C2PA, `firefly-1.png` = Adobe, `mj-*` = Midjourney IPTC, `doubao-1.png` = ByteDance Doubao with the China TC260 `<TC260:AIGC>` XMP label **and** a visible "豆包AI生成" text mark bottom-right; `grok-1.jpg` = xAI Grok with its EXIF-only `Signature:` blob + UUID `Artist` and no C2PA/SynthID/IPTC; `flux-1.png` / `flux-1.jpg` = real Black Forest Labs FLUX.2 Playground output, signed C2PA (issuer "Black Forest Labs" + `trainedAlgorithmicMedia`) -- `flux-1.jpg` is the first committed **JPEG-with-C2PA** fixture, exercising the c2pa-python non-PNG reader path end to end; note BFL's hosted output carries C2PA only, NOT the open DWT-DCT pixel watermark, see `docs/watermarking-landscape.md`); synthetic byte blobs cover the remaining JPEG/ISOBMFF format paths. The "non-AI / clean photo" control is no longer in `data/samples/` -- the `clean_photo` conftest fixture serves a verified-negative image from the corpus `neg/` set (skips if the corpus is absent).
 - SynthID reference corpus: `scripts/synthid_corpus.py` ingests labeled images into `data/synthid_corpus/`. The labeled `images/` (`pos/` `neg/` `cleaned/`) are **committed** (public repo -- review every image for private content before adding; `manifest.csv` is kept in sync with the files on disk, one row per tracked image); only the synthetic `refs/` calibration fills are gitignored. See its README for the collection protocol and verification oracles. **`cleaned/` examples must be produced by a CURRENT shipped removal method** -- the default SDXL img2img pass (optionally `--max-resolution`). Do NOT archive cleaned outputs from methods that are no longer in the pipeline (ctrlregen, the old text/face-protection, IP-Adapter FaceID, CodeFormer) or from the experimental opt-in paths (controlnet, face restore) as corpus examples; a cleaned reference should represent the canonical removal, and a removed method's output is not a reproducible example. Keep those experiment outputs in a local working dir, never in the committed corpus.

 ## Configuration
@@ -5,7 +5,7 @@
 > no content was changed or summarized.

 Who embeds what, and whether it is locally detectable (so we know which gaps are fillable). See `identify.py` for what we read.
- **Locally detectable (open decoder, no key/API):** Stable Diffusion / SDXL / FLUX via `imwatermark` DWT-DCT (now covered by `invisible_watermark.py`). FLUX uses the same library (`black-forest-labs/flux2` `src/flux2/watermark.py`, 48-bit `0b001010101111111010000111100111001111010100101110`); SDXL is the diffusers `WATERMARK_MESSAGE` (`0b101100111110110010010000011110111011000110011110`). Caveat: fragile to re-encoding.
+- **Locally detectable (open decoder, no key/API):** Stable Diffusion / SDXL / FLUX via `imwatermark` DWT-DCT (now covered by `invisible_watermark.py`). FLUX uses the same library (`black-forest-labs/flux2` `src/flux2/watermark.py`, 48-bit `0b001010101111111010000111100111001111010100101110`); SDXL is the diffusers `WATERMARK_MESSAGE` (`0b101100111110110010010000011110111011000110011110`). Caveat: fragile to re-encoding. **The FLUX open DWT-DCT is OPTIONAL dev-inference-code only and is NOT applied by Black Forest Labs' hosted surface** — verified 2026-06-19 on 2 real FLUX.2 BFL Playground outputs (PNG + JPG): both carry the signed C2PA manifest (issuer "Black Forest Labs") but the DWT-DCT decode returns the degenerate all-ones (28/27 of 48 bits match the FLUX reference, i.e. chance level, threshold 44), so the open pixel watermark is absent. Practical consequence: a hosted FLUX.2 image is identified by C2PA only; once C2PA is stripped there is NO open-pixel fallback for it (the 48-bit pattern in `_BITS_48` is correct and would fire only on a locally-generated FLUX.2 with the watermark flag explicitly enabled).
 - **C2PA / IPTC (covered by the issuer/marker scan):** OpenAI, Google, Adobe Firefly, Microsoft (Designer + **Bing Image Creator** — collected 2026-05-24; Bing now runs Microsoft's own **MAI-Image** model, signs C2PA as "Microsoft", NOT OpenAI/DALL-E), **Stability AI** (collected from Brand Studio / DreamStudio successor; signs C2PA as "Stability AI Ltd", no SynthID, no imwatermark on its current Stable Image model — issuer added to `C2PA_ISSUERS`), and **Canva** (Magic Media signs C2PA as "Canva" + `trainedAlgorithmicMedia` with a generic `c2pa-rs` claim generator, no SynthID — issuer `b"Canva"` → "Canva (Magic Media)"; found on real production traffic 2026-06-19, which **disproved the earlier assumption** that Canva downloads are re-encoded exports that always strip C2PA). Still unsampled: Getty, Shutterstock. Midjourney embeds NO C2PA and no invisible watermark (our `mj-*` sample carried only the IPTC tag).

 **Samsung Galaxy AI** (Generative Edit / Sketch to Image / Portrait Studio on Galaxy S23 FE / S24 / S25, One UI 7+) signs C2PA as "Samsung Galaxy" with the standard `trainedAlgorithmicMedia` source type AND a proprietary `genAIType` marker; verified on real signed files 2026-05-29 (the standard scan catches the source type; `genAIType` additionally catches a Galaxy S24 file that omits it). It ALSO burns a **visible** localized wordmark into the pixels — a sparkle + "generated with AI" string in the bottom-LEFT corner (issue #37; the Italian "✦ Contenuti generati dall'AI" variant is calibrated) — removed by `samsung_engine.py` / `visible --mark samsung` (reverse-alpha, see the engine bullet); detection feeds `identify` as the medium `visible_samsung` signal. The string is locale-specific, so each locale needs its own captured alpha template.
@@ -229,6 +229,19 @@ class TestIdentifyRealSamples:
        assert r.is_ai_generated is True
        assert any("IPTC" in w for w in r.watermarks)

+    def test_flux_bfl_c2pa_png(self):
+        # flux-1.png: real Black Forest Labs FLUX.2 Playground output (signed C2PA).
+        r = identify(SAMPLES_DIR / "flux-1.png", check_visible=False)
+        assert r.is_ai_generated is True
+        assert r.platform == "Black Forest Labs (FLUX)"
+
+    def test_flux_bfl_c2pa_jpeg_via_reader(self):
+        # flux-1.jpg: same source as a JPEG -- the real committed JPEG-with-C2PA
+        # fixture that exercises the c2pa-python non-PNG reader path end to end.
+        r = identify(SAMPLES_DIR / "flux-1.jpg", check_visible=False)
+        assert r.is_ai_generated is True
+        assert r.platform == "Black Forest Labs (FLUX)"
+
    def test_clean_photo_is_unknown_not_clean(self, clean_photo: Path):
        r = identify(clean_photo, check_visible=False)
        assert r.is_ai_generated is None  # never asserted False
@@ -182,6 +182,14 @@ class TestC2PARealSamples:
        assert has_c2pa_metadata(out)
        assert "OpenAI" in extract_c2pa_info(out)["issuer"]

+    def test_extract_info_flux_jpeg_via_reader(self):
+        """Real committed JPEG-with-C2PA fixture: the non-PNG reader path works."""
+        info = extract_c2pa_info(SAMPLES_DIR / "flux-1.jpg")
+        assert info["has_c2pa"] is True
+        assert info["c2pa_manifest"].startswith("C2PA manifest store")  # reader, not chunk
+        assert "Black Forest Labs" in info["issuer"]
+        assert "trainedAlgorithmicMedia" in info["source_type"]
+
    def test_extract_info_uses_reader_store(self):
        """The c2pa-python reader path: structured (not heuristic) extraction."""
        from remove_ai_watermarks.noai import c2pa