Commit Graph

9 Commits

Author SHA1 Message Date
Victor Kuznetsov 8523f48fb6 data(corpus): archive June 2026 SynthID strength-study subjects
Back docs/synthid.md section 2.2 with the actual test set: the per-image
oracle-verified subjects were only in a local working dir, while the doc claimed
they were recorded in data/synthid_corpus/. Ingest the key pos+cleaned pairs so
the claim holds.

- pos: openai_1/2/3 originals (gpt-image, openai-verify) + gemini_1/2/3/4
  originals (Gemini app, gemini-app); all probe as C2PA-SynthID present.
- cleaned: OpenAI at strength 0.05 (openai_2 only s010 captured) + Gemini at 0.15
  --max-resolution 1536; oracle: SynthID NOT detected. Metadata stripped, so no
  C2PA on the cleaned rows.
- Excluded the third-party issue #14 image (pic3): oracle-verified but not
  committed to the public corpus.
- docs/synthid.md 2.2: state OpenAI n=4 = 3 archived + 1 external-only.
- CLAUDE.md: drop the drift-prone "~65 MB" corpus size from the sdist note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 17:09:58 -07:00
test-user 03fb460f77 Track the labeled SynthID corpus; complete metadata-source test coverage
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.

Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
  synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
  for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
  for private content before adding -- public repo, permanent history).

Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
  the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
  (skips if the corpus is absent).

Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
  compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
  CreatorTool were already covered).

All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:46:47 -07:00
test-user 59d72c5db7 Record verified gpt-image-2 SynthID-cleaned chain in corpus
Add manifest row for the 4ef377bd -> f6dd47a5 chain: a gpt-image-2 sample
(openai.com/verify: SynthID + C2PA detected) cleaned via v0.5.3 `all` at
native 1254x1254 (prod-equivalent SDXL base, strength 0.05, 50 steps).
openai.com/verify reports SynthID NOT detected after the run, re-confirming
that the #10 native-resolution default defeats OpenAI SynthID and resolving
the #15 root cause (older SD-1.5/768px downscale default did not).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 13:55:10 -07:00
test-user 93c664f7fb docs: sync README + corpus map with v0.5.x detection coverage
- README Features: add the identify / provenance-detection capability.
- README Supported models: add FLUX, Stability AI, Microsoft/Bing
  (MAI-Image), Meta AI rows; note SD/SDXL/FLUX imwatermark is locally
  detectable; add a detection note pointing at identify.
- corpus README per-platform map: add Stability / Ideogram / Recraft /
  Krea-FLUX rows + an imwatermark column; correct Bing (MAI-Image,
  signs 'Microsoft'); note imwatermark fires only on pristine pipeline
  output, not re-hosts/exports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 09:32:03 -07:00
test-user ede35a3db5 feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX
Collected live samples from three popular generators we lacked:

- Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no
  SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator
  only read Software/Artist/ImageDescription, so it missed this; now reads
  Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this
  stays low-false-positive. 4 originals ingested.
- Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable
  signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running
  FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures.

Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine
output from a pipeline that runs the encoder (diffusers default, official BFL),
not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 18:38:56 -07:00
test-user 3a1c5427c8 feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label
Collected live C2PA positives from Bing Image Creator and Stability Brand
Studio (DreamStudio successor) and learned two things our scan got wrong:

- Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs
  C2PA as 'Microsoft'. The scan caught it, but the platform label claimed
  'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral:
  'Microsoft (Bing Image Creator / Designer)'.
- Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which
  was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer
  and a platform mapping. Stability uses no SynthID and (on its current
  Stable Image model) no imwatermark watermark -- verified, both negative.

Both ingested as SynthID-negative corpus fixtures (they are AI but not
SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded
design exports that strip C2PA, so a Canva sample would be inconclusive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:12:42 -07:00
test-user af787fd8d6 docs(corpus): per-platform watermark map + surface-dependent blind spot
Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives,
2 positives) and document what was learned studying 8 platforms:

- README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC
  / visible sparkle per platform) and an "originals, not previews" note
  (re-encoded previews strip metadata, so a clean preview is not proof).
- CLAUDE.md: surface-dependent blind spot -- the same Google model wraps
  C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle
  with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so
  synthid_source returns None despite SynthID being present; only the
  pixel oracle or the visible-sparkle detector catches those.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:55:17 -07:00
test-user da0edcbddc chore(corpus): grow SynthID reference set + document autonomous Chrome collection
Adds content positives (OpenAI gpt-image: forest, fisherman, tokyo; Google
gemini: fisherman, mug) and SDXL/non-SynthID negatives to the local corpus
manifest. Now spans 4 resolutions across 2 vendors (was solid-black only).

README: documents driving generation via Chrome MCP -- Gemini single-click
download; ChatGPT via in-page fetch+blob (preserves original C2PA bytes,
unlike the flaky UI download / a canvas re-encode).

Images stay gitignored; only the manifest (sha256 + labels + extracted
metadata) and protocol are tracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 12:46:46 -07:00
test-user f07ce10c72 feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests
Detect SynthID-bearing images via their C2PA companion: a manifest signed by a
SynthID-using vendor (Google/OpenAI) on AI-generated content implies an
invisible SynthID pixel watermark. Verified end-to-end against the vendor
oracles (openai.com/verify, Gemini "Verify with SynthID").

- metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata,
  surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser +
  JPEG/WebP/AVIF/HEIF/JXL binary scan).
- constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions.
- c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex
  (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from
  metadata; shared synthid_verdict / synthid_vendors_in helpers.
- corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/
  (manifest tracked, images gitignored) for a labeled reference set.
- tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF
  container stripping, all IPTC AI markers, and invisible watermark strength
  tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...).

Pixel-level SynthID detection remains out of reach locally (Google's decoder is
proprietary); a from-scratch spectral pilot confirmed it does not separate real
content. See CLAUDE.md for the full evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 11:32:46 -07:00