docs(roadmap): record next steps for SynthID detector work

Captures the forward plan so a future session picks it up: local pixel
detector is blocked pending a generation API or raw watermarked dataset
(spectral methods shown insufficient); grow the oracle-labeled corpus;
replace synthetic non-PNG C2PA fixtures with real ones; and the maintenance
debt (idna bump, strict-pyright cleanup) needed for a green maintain.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
test-user
2026-05-24 11:34:43 -07:00
parent f07ce10c72
commit c006f9b8b4
+4
View File
@@ -258,6 +258,10 @@ pip install certifi
Tracked but not yet implemented:
- **SynthID-Image v2 automated regression test**. The default SDXL profile defeats v2 per manual checks against the [Gemini app](https://support.google.com/gemini/answer/16722517)'s "Verify with SynthID" feature on a Gemini 3 Pro output (May 2026). An automated end-to-end test would need either programmatic access to the [SynthID Detector portal](https://blog.google/innovation-and-ai/products/google-synthid-ai-content-detector/) (waitlist for media professionals and researchers) or an offline surrogate detector. The spectral phase-coherence surrogate from [reverse-SynthID](https://github.com/aloshdenny/reverse-SynthID) was evaluated and does not separate watermarked from cleaned real-content images (it only fires on controlled solid-color references at exact resolution), so it is not a usable oracle. Open.
- **Local SynthID *pixel* detector**. Not feasible today: Google's decoder is proprietary, and magnitude/carrier spectral methods do not separate real content (confirmed by three independent evaluations, including a from-scratch gpt-image pilot; see CLAUDE.md). Blocked on either (a) a programmatic generation path (OpenAI / Gemini API) to build a per-(model, resolution) labeled corpus at scale, or (b) a raw watermarked-output dataset. If data arrives, the next approach to try is a learned classifier on diverse content rather than a fixed carrier codebook.
- **Grow the SynthID reference corpus** (`data/synthid_corpus/`) with oracle-labeled samples per model and resolution (Gemini app for Google, openai.com/verify for OpenAI). Prerequisite for any pixel-detector attempt and for an automated removal-regression set.
- **Real non-PNG C2PA fixtures**. SynthID-source detection for JPEG / WebP / AVIF is currently covered only by synthetic byte blobs; replace with real vendor-emitted files to ground the binary-scan path.
- **Maintenance debt**. Bump transitive `idna` to 3.15 (blocks `uv-secure`); clear strict-pyright debt in `remove_ai_metadata` / `cli.py` (untyped piexif / PIL / click / rich) so `maintain.sh` can finish green.
- **AVIF / HEIF / JPEG-XL detection limits**. Removal strips top-level C2PA `uuid` and JUMBF `jumb` boxes. EXIF/XMP boxes inside these containers are not yet scrubbed (PNG and JPEG are fully covered).
- **Video pipeline (`noai-video`)**: per-frame inpainting and tracking for Sora 2 dynamic logo, Veo 3.1 badge, Kling, Runway. Separate package, not folded into this repo.