20 Commits

Author SHA1 Message Date
Victor Kuznetsov df0fafe94e fix(identify): stop flagging multi-actor C2PA manifests as integrity clashes
The C2PA issuer attribution (`c2pa`) and the SynthID proxy (`synthid`) are
derived from the same manifest, so treating them as independent signals made
rule 1 fire on legitimate multi-actor manifests where a product wraps another
vendor's engine (Microsoft Designer on OpenAI, Microsoft on Google) or an edit
chain re-signs (Adobe over a Gemini original). 19 such files in the
2026-06-01/02 spaces batches read as "likely spoofed/laundered" before this.

Group `c2pa` + `synthid` into one provenance source via `_CLASH_SOURCE`; rule 1
now requires two vendors from different sources. A manifest vendor still clashes
with a genuinely independent stamp (EXIF/XMP generator, IPTC AISystemUsed, AIGC,
xAI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:02:35 -07:00
Victor Kuznetsov e501bec9ff feat(identify): detect visible Doubao/Jimeng marks; keep identify import torch-free
identify previously ran only the Gemini sparkle as a visible detector, so a
Doubao/Jimeng image with stripped TC260 metadata had no visible fallback. Add
`_visible_text_marks` (registry-backed) so the ByteDance Doubao 豆包AI生成 and
Jimeng 即梦AI marks are detected too, each gated by its own engine NCC threshold
via MarkDetection.detected. New signals `visible_doubao` / `visible_jimeng`
(medium), same stripped-metadata fallback role as the sparkle; excluded from
integrity-clash vendor claims; set platform only when no harder signal did.

Also make `noai/__init__` lazy (PEP 562 __getattr__): importing the light
`noai.c2pa` / `noai.constants` submodules (which identify needs) no longer
eagerly pulls `watermark_remover`, which imports torch + diffusers at module
top. `import remove_ai_watermarks.identify` drops from ~420 MB to ~21 MB in a
full gpu/detect install (torch not loaded), so it fits a 512 MB host; the
removal API resolves lazily on first access. Guarded by TestIdentifyImportIsLight.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 20:43:52 -07:00
Victor Kuznetsov e572767555 feat(visible): add Jimeng remover, fix Doubao outline defect, reproducible mask build
Visible-watermark work across all three corner-mark engines plus a committed,
reproducible alpha-build pipeline (scripts/visible_alpha_solve.py) fed by committed
solid black/gray/white captures.

- jimeng: new "即梦AI" wordmark remover (reverse-alpha + thin residual inpaint,
  always NCC-aligned -- the mark re-rasterizes/jitters per image). Detect via glyph
  silhouette NCC (0.45 threshold; does not cross-fire with Doubao). Registered in the
  visible-mark catalog; `visible --mark jimeng` / `--mark auto`.
- doubao: fix a real production defect -- the shipped remover left a READABLE
  "豆包AI生成" outline on real samples while detect() returned conf 0.0 (fooled by a
  thin outline), so the test passed and the "56/56 clean" claim was detector-measured,
  not visual. Root cause: under-estimated alpha + fixed-geometry-no-inpaint + tight
  locate box. Rebuilt alpha (careful gray-self solve), always-align, thin inpaint,
  widened locate box -> readable outline becomes faint texture-level traces.
- gemini: rebuild gemini_bg_{96,48} from our own controlled captures (validated NCC
  0.9998 vs the prior third-party asset); removal re-verified clean, no behaviour change.
- tests: add textured-shift regression to both engines (guards the align-on-shift path
  the Doubao defect exposed; lesson: a detector-only removal test is insufficient,
  assert visual residual).
- docs: CLAUDE.md, README, capture READMEs and docstrings synced; stale
  "exact/pixel-exact/56-clean" claims removed.

Also includes a SynthID label-wording clarification in identify.py/cli.py
("SynthID pixel watermark" -> "SynthID watermark, inferred from C2PA metadata").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 12:20:19 -07:00
Victor Kuznetsov 5d0e6c3a65 fix: harden metadata parsers and engines; sync docs (full-repo review)
Apply fixes from a full-repo review (code, tests, docs).

Security / correctness:
- Clamp attacker-controlled PNG/caBX chunk lengths to the remaining file
  size in metadata.py and noai/c2pa.py (a malformed length no longer drives
  a multi-GB read); skipped chunks seek instead of read.
- noai/isobmff.strip_c2pa_boxes is now fail-safe on a malformed box: return
  the original bytes with a warning instead of silently truncating the tail,
  so metadata --remove can no longer emit a corrupt file.
- doubao_engine._fixed_alpha_map clamps the glyph box to the image (no crash
  on degenerate width-vs-height).
- watermark_remover._run_region_hires gates the phaseCorrelate offset on
  response and magnitude (a spurious shift no longer garbles text) and drops
  the generator after a CPU fallback (no MPS/CPU device mismatch).

Robustness:
- gemini_engine, doubao_engine, region_eraser normalize grayscale and RGBA
  inputs to BGR at the engine entry points.
- image_io.imwrite returns False on an unwritable path (matches cv2).
- invisible_engine guards a None imread result before use.
- trustmark_detector._decoder uses a double-checked threading lock.
- ctrlregen.tiling.tile_positions raises on overlap >= tile.
- humanizer chromatic shift no longer wraps opposite-edge pixels.
- identify OpenAI caveat keyed on the normalized vendor, not a substring.
- Remove the dead "visible --detect-threshold" CLI option.
- publish.yml verifies the release tag matches the package version.

Docs:
- README strength 0.05 to 0.10; .env.example HF_TOKEN marked optional;
  doubao_capture README updated to reverse-alpha-only; CLAUDE.md synced with
  the new behaviors and the batch command.

Tests: new test_security_clamp.py for the read clamp and isobmff fail-safe;
erase CLI coverage; integrity-clash rule 2 end-to-end; multi-tag EXIF
survival and cross-format strip guards; channel/size, tiling, humanizer, and
imwrite regressions. Full suite 493 passed, 2 skipped; ruff and pyright src/
clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 18:00:39 -07:00
Victor Kuznetsov 58bdf51c59 Visible-watermark registry: reverse-alpha-only Doubao + Gemini, exact native recovery (#28)
* fix(trustmark): gate detection on re-encode durability to kill false positives

TrustMark's wm_present flag is a BCH validity check that spuriously
validates on a content-correlated fraction of un-watermarked images
(AI textures trip it more than camera photos). On a 1343-image set all
20 raw detections were false, several on Gemini/OpenAI/Doubao output that
cannot carry Adobe's watermark, with random-bytes secrets.

A genuine TrustMark is a durable soft binding that survives re-encoding,
so detect_trustmark now re-decodes after a mild JPEG round-trip and
requires the same schema both times. Every observed false positive
collapsed under this gate; the second decode runs only on the rare hit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(identify): Samsung Galaxy AI, FLUX, ByteDance C2PA; fix C2PA substring FP

Detection extensions verified on real signed files (2026-05-29):

- Samsung Galaxy AI: signer attribution via a new _SIGNER_C2PA_PLATFORM
  (Samsung Galaxy / ASUS Gallery) kept separate from the capture-camera
  _DEVICE_C2PA_PLATFORM so a Galaxy AI edit (device cert + AI source type)
  does not trip the camera-vs-AI integrity clash. Plus metadata.samsung_genai:
  the proprietary genAIType marker in PhotoEditor_Re_Edit_Data, a medium-
  confidence AI-editing signal (samsung_only branch).
- Black Forest Labs (FLUX) and ByteDance Volcano Engine (Doubao/Jimeng)
  added as C2PA issuers + issuer->platform mappings.
- fix: C2PA presence required only the bare 4-byte 'c2pa' substring, which
  false-positives on compressed pixel data (a recompressed PNG IDAT re-flagged
  C2PA after its manifest was correctly stripped). New c2pa_marker_in() requires
  the JUMBF wrapper (jumb+c2pa) or the C2PA uuid box; applied in identify +
  metadata. Verified: all 535 real C2PA files carry jumb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): gate detection on text structure to cut ~95% of false positives (#23)

Coverage alone over-fired: any textured bottom-right corner cleared the
threshold, so the detector false-positived on ~28% of arbitrary images.
The real '豆包AI生成' mark is six glyphs in one row, so detect now also
requires the text-structure signature (_glyph_structure): many connected
components, no single dominant blob, concentration in a thin horizontal
band. False positives dropped 343 -> 17 across the corpus while keeping
real-mark recall and the doubao-1.png sample. Also accept a no-op force
kwarg for remover-interface symmetry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(samsung): add Samsung Galaxy AI visible-badge remover

New samsung_engine.py removes the bottom-left sparkle + localized
'AI-generated content' badge that Galaxy AI tools stamp. Mirrors the
Doubao locate->mask->inpaint pattern but bottom-left, with a dual-polarity
top-hat mask (the badge is light-on-dark or dark-on-light). Detection gates
on a band + left-anchor signature (the Doubao CJK-component gate does not
transfer: Latin badge letters connect into few blobs). Explicit-only --
tuned on few real badges with a ~4% FP floor, so it is not used in auto.
Synthetic byte-blob fixtures (real badges are user content, not shipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(visible): unified known-watermark registry + LaMa inpaint backend

watermark_registry.py is a single catalog of known visible marks, each
tying {usual location, in_auto flag, recovery strategy, detect adapter,
remove adapter}: gemini (reverse-alpha, exact), doubao, samsung. cmd_visible
is now registry-driven (best_auto_mark for --mark auto; mark_keys() feeds the
CLI choices) -- the per-mark _run_doubao/_run_samsung helper branches are gone.

Cross-engine confidences are not comparable, so the gemini adapter applies the
corpus-validated 0.5 sparkle threshold for auto arbitration (its engine flag is
loose and weakly fired ~0.36 on Doubao text, hijacking auto).

--backend auto|cv2|lama chooses background reconstruction for the mask-based
marks; auto = LaMa when onnxruntime is present, else cv2. For LaMa the mask is
the FILLED glyph bounding box (sparse glyph masks leave anti-aliased edges
behind). cv2 stays the zero-dependency fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: watermark registry, Samsung/FLUX/ByteDance detection, LaMa backend, trustmark gate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): exact reverse-alpha removal from captured alpha map

The Doubao '豆包AI生成' mark is a fixed semi-transparent white overlay, so
given its alpha map the original pixels are recovered exactly:
original = (wm - a*logo)/(1-a) -- no inpaint hallucination.

The alpha map + logo colour were solved from real black+gray Doubao captures
on a controlled background: on black captured = a*logo, and the black/gray pair
solves a per-pixel without assuming the logo colour (a_max~0.65, logo near-white);
the white capture cross-validates (mark vanishes to a flat fill). Bundled as
assets/doubao_alpha.png + geometry constants.

remove_watermark_reverse_alpha applies it scaled to image width; exact at the
captured width, so the registry routes doubao through it only when
reverse_alpha_available (width within the calibrated band) and the mark is
detected, falling back to mask inpaint (cv2/LaMa) otherwise. A light residual
inpaint cleans the sub-pixel rescaling error. Add captures at more resolutions
to widen exact coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(visible): reverse-alpha only -- drop inpaint removal + heuristic detection

Per the principle that we only remove/detect what we can do exactly, the
visible-mark path is now reverse-alpha only:

- Doubao detect is reverse-alpha-consistent: match the bundled alpha glyph
  silhouette against the corner via TM_CCOEFF_NORMED (DETECT_NCC_THRESHOLD 0.4)
  -- keys on the '豆包AI生成' SHAPE, not coverage/structure heuristics. FP
  7/1243 (0.6%). Removes the cv2 inpaint path + the _glyph_structure gate.
- Registry is reverse-alpha only: dropped the cv2/LaMa backend (_glyph_remove,
  _lama_box_inpaint, default_backend, --backend) and the Samsung entry. Doubao
  outside the alpha resolution band is skipped, never inpainted.
- Removed samsung_engine.py + tests + --mark samsung (no alpha map captured;
  Samsung C2PA/genAIType metadata detection in identify is unaffected).
- The universal erase --region (cv2/LaMa) is unchanged -- arbitrary-region
  inpainting stays a user-directed tool, separate from the known-mark registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(doubao): NCC sub-pixel alignment -> reverse-alpha at any resolution

A pure width-scale of the captured alpha map is only sub-pixel-accurate at the
captured width and leaves a faint ghost elsewhere. remove_watermark_reverse_alpha
now registers the alpha glyph to the actual mark via a TM_CCOEFF_NORMED
scale+position search (_aligned_alpha_map) before inverting the blend, so the
single 2048 capture works at any resolution -- verified clean on the 1773x2364
(3:4) corpus size, the biggest coverage gap (23 files).

reverse_alpha_available is now just 'asset present' (no width band); the registry
still gates removal on detect so a clean corner is never touched. Drops the
_ALPHA_WIDTH_TOLERANCE gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): keep native recovery exact -- fixed geometry at captured width

Integer-pixel NCC alignment landed ~1px off at the captured width, degrading the
otherwise-exact native reverse-alpha (synthetic recovery error 0.94 -> 1.39).
remove_watermark_reverse_alpha now uses exact width-relative geometry within
_ALPHA_NATIVE_BAND of the captured width and the NCC search only off it -- best
of both: native back to 0.94, other resolutions still aligned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(doubao): harden alignment -- try fixed+aligned, keep least residual (56/56)

On a faint/busy-background mark the NCC alignment peak can wander a few px off
the true mark and leave a residual (2/56 real corpus files). Off the captured
width, remove_watermark_reverse_alpha now builds BOTH the fixed-geometry and the
NCC-aligned alpha map, applies each, and keeps whichever leaves the least
residual mark (re-detect confidence on the bare reverse-alpha) -- geometry wins
on faint marks, alignment on clear ones, no magic threshold. Real-file round-trip
now removes 56/56 detected Doubao clean across every corpus resolution (was 54).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(doubao): skip residual inpaint at native width for exact recovery

At the captured width the fixed-geometry reverse-alpha is pixel-exact, so
inpainting over it only replaced exactly-recovered interior pixels with a
cv2 hallucination -- measured worse on a textured background (native error
vs true bg 1.6 reverse-alpha-only vs 2.6 with the old always-on
full-footprint inpaint). Native now returns the bare recovery untouched;
off-native, where NCC alignment is only sub-pixel-approximate, the footprint
inpaint stays to clean the seam. Real round-trip still 56/56 across all
corpus resolutions; negatives 0/60, Gemini unaffected.

Add test_native_returns_exact_reverse_alpha_no_inpaint as the regression
guard. Sync CLAUDE.md + README (the table cell and prose described the
pre-NCC "skipped off native / cv2-LaMa" behavior, now stale). Gitignore the
session scheduled_tasks.lock, and add the text-protection research note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 19:49:09 -07:00
Victor Kuznetsov 223cbcf171 feat(metadata): detect China TC260 AIGC PNG chunk and HuggingFace hf-job-id
aigc_label now reads the TC260 label from a raw-JSON `AIGC` PNG tEXt chunk
(as Doubao/ByteDance write it, with no namespaced XMP marker) in addition to
the `<TC260:AIGC>` XMP block, via a shared _parse helper gated on a TC260 field
so a generic AIGC key cannot false-positive. New huggingface_job() reads the
hf-job-id PNG chunk; identify surfaces it as a medium-confidence hf_job signal
(parallel to the visible sparkle, never overriding a hard metadata verdict).
Both wired into has_ai_metadata/get_ai_metadata; the PNG save whitelist already
strips them on removal. Found by auditing 646 corpus originals: 28 AIGC and 3
hf-job files the library previously reported as Unknown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:40:17 -07:00
test-user 18160fe269 feat(identify): integrity-clash detection for contradictory provenance (v0.6.7)
Surface contradictions between independent provenance signals instead of
collapsing to a single verdict -- a strong tell of spoofed, transplanted, or
laundered metadata. Inspired by arXiv:2603.02378.

Two rules in the new _integrity_clashes helper:
- Conflicting AI-origin attributions: two or more distinct AI vendors named by
  independent generator stamps (e.g. a C2PA OpenAI manifest on an image whose
  EXIF says Make="Ideogram AI").
- Camera + AI: a camera-capture C2PA device (Pixel/Leica/Sony/Nikon/Truepic)
  coexisting with an AI-generation marker -- a genuine capture is not AI.

High-precision by design: only hard generator stamps feed it (C2PA issuer when
the source is AI, SynthID proxy, EXIF/XMP generator, IPTC AISystemUsed, xAI,
AIGC). The fuzzy visible sparkle and the open invisible watermark are excluded
-- the latter can be a by-product of our own SDXL removal pass. Vendor
normalization (_vendor_of over _AI_VENDOR_TOKENS) keeps consistent signals from
clashing (C2PA "Google (Gemini)" + SynthID-Google agree); the C2PA vendor is
read from the issuer attribution, not the resolved platform, so a camera label
like "Google Pixel" cannot mis-normalize to an AI vendor.

Surfaced as ProvenanceReport.integrity_clashes (red in the table view, included
in --json). 19 new tests; all real single-origin fixtures (chatgpt/firefly/
doubao/grok/mj) verified to produce zero clashes (false-positive guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:27:25 -07:00
test-user e1c99b5937 fix(identify): gate C2PA issuer->generator attribution on AI source type (v0.6.5)
Prevents an unmapped C2PA device whose manifest incidentally contains a mapped
issuer substring (e.g. the "Adobe XMP" toolkit string in a Canon/Sony camera
capture) from being mislabeled as that AI generator ("Adobe Firefly").
_attribute_platform now names a specific AI-generator platform only when the
digital-source-type is trainedAlgorithmicMedia; otherwise it degrades to the
neutral "C2PA signer: X" label. Real Firefly/OpenAI/Google output carries the
AI source-type and is unaffected (verified: chatgpt-1.png->OpenAI,
firefly-1.png->Adobe Firefly still attribute). Closes the only real downside of
leaving Canon/Samsung/Bria device signers unmapped: detection and removal were
already unaffected; now the platform label degrades gracefully too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:29:12 -07:00
test-user 9f93d9c0c5 feat(identify): add Sony C2PA device attribution, verified (v0.6.3)
Adds Sony to _DEVICE_C2PA_PLATFORM, matching Sony's own `sony.sig` / `sony.cert`
C2PA assertion namespace (NOT bare "Sony", which is a common EXIF Make). Verified
against a real Sony-signed file (Sony PXW-Z300, signer "Sony Corporation") found
in the Security4Media/c2pa-video-player repo. The sample is video (MP4) -- our
ISOBMFF C2PA path detects it; Sony Alpha stills likely share the namespace.

Verified device set is now Leica, Nikon, Google Pixel, Sony, Truepic. Canon /
Samsung / Bria still have no public direct-download C2PA sample to verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:13:49 -07:00
test-user 64be9598f2 fix(identify): device-token-first C2PA attribution; add verified Pixel (v0.6.2)
Replaces the claim-generator-string match with a distinctive device-token scan
of the manifest bytes (_device_platform / _DEVICE_C2PA_PLATFORM), which is more
robust: it catches devices where the generator name lives under a non-standard
CBOR key (Pixel uses `claim_generator_info`, so it has no `claim_generator`).

- Adds Google Pixel, verified against a real Pixel 10 Pro C2PA file (attached to
  c2pa-rs issue #1609/#1554): cert CN "Pixel Camera", digitalSourceType
  `computationalCapture` -> capture authenticity, not AI (is_ai stays None).
- Token distinctiveness is load-bearing: bare "Truepic" matched the OpenAI
  chatgpt-1.png fixture (Truepic is a trust-chain signing authority), so the
  token is the specific "Truepic_Lens"; "Pixel Camera" (cert CN) not "Pixel".
- Verified Leica/Nikon/Truepic/Pixel attribute correctly and OpenAI/Adobe/MJ
  do not regress. Sony/Canon/Samsung/Bria stay unmapped: no public direct-
  download C2PA sample exists to verify their in-manifest string.
- Regression tests: device token beats incidental issuer mentions (Leica,
  Pixel-vs-Google).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 20:43:40 -07:00
test-user dda2ee7fbb fix(identify): attribute C2PA by claim_generator, not incidental issuer tokens (v0.6.1)
Verified on real signed files that the issuer byte-scan mis-attributes
multi-entity manifests: Leica read as "Truepic" (timestamp authority in the
chain), Nikon as "Adobe Firefly" (XMP-toolkit "Adobe" + the sample's
"Adobe_MAX" name), Truepic as "Google". Platform attribution now prefers the
claim generator (what produced the asset) and falls back to the issuer scan.

- New _CLAIM_GENERATOR_PLATFORM map + _platform_from_generator; claim generator
  read for non-PNG via the now-public c2pa.cbor_text_after.
- Device tokens listed only where verified against a real C2PA file (Leica
  lc_c2pa, Nikon, Truepic Lens); Pixel/Samsung/Sony/Canon/Bria deferred until a
  real sample confirms the in-manifest string. Camera C2PA marks capture
  authenticity, so these never set is_ai.
- cbor_text_after made public (was _cbor_text_after); call sites + tests updated.
- Regression test: claim_generator beats incidental Adobe/Google/Truepic tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 20:10:07 -07:00
test-user c196a16900 feat: detect soft-binding vendors, IPTC 2025.1, video/audio C2PA, TrustMark (v0.6.0)
Broadens metadata provenance coverage at the detection and container-strip level.

Detection:
- C2PA soft-binding `alg` -> forensic-watermark vendor (Adobe TrustMark,
  Digimarc, Imatag, Steg.AI, Microsoft, ...) via C2PA_SOFT_BINDINGS +
  soft_binding_vendors_in(); names the watermark vendor even when the watermark
  itself can't be decoded.
- IPTC Photo Metadata 2025.1 AI-disclosure XMP fields (AISystemUsed etc.) via
  iptc_ai_system() + IPTC_AI_FIELD_MARKERS.
- Adobe TrustMark open keyless decoder (trustmark_detector.py, optional extra
  `trustmark`) -- the watermark behind Adobe Durable Content Credentials.
  Detects provenance, not AI origin, so it does not assert is_ai.

Removal / containers:
- isobmff.strip_c2pa_boxes now also drops a top-level XMP uuid box that carries
  an AI label (matched by AI-marker content, byte-order-robust; plain XMP kept).
- remove_ai_metadata routes MP4/MOV/M4V/M4A (and any ftyp-sniffed ISOBMFF)
  through the box stripper; raises a clear error for non-ISOBMFF audio/video
  (WebM/MP3/WAV) instead of crashing in the image path.

Tests: soft-binding scan, IPTC element/attribute/presence, MP4 + M4A detect/
strip, ISOBMFF XMP surgical strip, content-sniff, unsupported-container guard,
TrustMark absent-safety + identify integration. ruff clean; pyright clean on
all new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 17:56:48 -07:00
test-user 74618b91a7 feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5)
xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance
signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist.
Verified stable across 3 genuine generations (a real download previously read
as unknown / "no AI metadata").

- metadata.xai_signature(): matches the Signature blob + UUID Artist pair;
  wired into has_ai_metadata, get_ai_metadata, and identify (platform
  "xAI (Grok / Aurora)").
- data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID
  is the public image id, not PII).
- Tests: synthetic-fixture unit tests, real-sample assertion, identify
  integration (322 passing).

Docs (research refresh, May 2026):
- C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the
  embedded manifest is stripped).
- New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02),
  South Korea AI Framework Act Art 31(3), California AB 853.
- Hedge removal claims: defeating the SynthID verifier is not forensic
  invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 14:14:35 -07:00
test-user 03fb460f77 Track the labeled SynthID corpus; complete metadata-source test coverage
Corpus images were gitignored (local-only). The negatives were reviewed and
cleared for publishing, so the labeled set is now committed (regular git, 65 MB
across 25 files) -- making the removal regression set reproducible and CI-able.

Corpus:
- Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the
  synthetic refs/ calibration fills gitignored.
- Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows
  for removed images pruned; dedup left one cleaned output, f6dd47a5).
- Rewrite the corpus README layout/policy (images committed; review every image
  for private content before adding -- public repo, permanent history).

Test fixtures:
- Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF).
- Add the clean_photo conftest fixture serving a verified-negative image from
  the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it
  (skips if the corpus is absent).

Metadata-source coverage (close the last sub-variant gaps):
- c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and
  compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy).
- exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP
  CreatorTool were already covered).

All 8 metadata-source kinds are now tested at both the unit and identify()
level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:46:47 -07:00
test-user c7f0d71f90 feat(identify): detect China TC260 AIGC label (Doubao et al.)
China-served generators embed an XMP <TC260:AIGC>{"Label":"1",...} block
(China's mandatory AI-content labeling, TC260 standard). Doubao (ByteDance)
uses it -- verified on the real #13 sample. It's none of C2PA / SynthID /
imwatermark / IPTC, so identify() previously returned unknown.

- metadata: AIGC_MARKERS + aigc_label() (json-decodes the HTML-entity-encoded
  block); has_ai_metadata + get_ai_metadata now surface it.
- identify: new 'aigc' signal -> is_ai True, platform 'China AIGC-labeled
  generator (TC260; e.g. Doubao)', carries the ContentProducer code.
- Container-agnostic raw-byte scan, so it covers the whole China-AIGC ecosystem
  (Jimeng/Kling/Qwen/Ernie share the standard).
- Tests: synthetic TC260 block (metadata + identify). Docs updated.

Addresses #13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:29:51 -07:00
test-user ad3b8ee248 feat(identify): read EXIF Software / XMP CreatorTool generator tags
Closes the documented gap where EXIF/XMP fields inside AVIF/HEIF/JXL went
unparsed. metadata.exif_generator extracts the EXIF Software/Artist tag
(via PIL+piexif, which opens AVIF natively) and the XMP CreatorTool (via a
container-agnostic raw-byte scan that also covers HEIF/JXL that PIL can't
open), and matches against AI_GENERATOR_TOKENS so only generator names
(Firefly, DALL-E, Midjourney, ComfyUI, ...) fire -- a plain 'Adobe
Photoshop' or 'GIMP' tag is not flagged.

identify() surfaces it as a high-confidence signal and uses it for
platform attribution when no C2PA names a platform, so an AVIF/HEIF whose
only AI signal is an EXIF/XMP generator tag is now caught.

Validated with synthesized fixtures (the 'no positive fixtures' blocker
was self-imposed): real AVIF and JPEG written with EXIF Software via PIL,
plus an XMP CreatorTool raw-scan fixture. Zero false positives across the
109-image corpus (real iPhone photos carry no AI generator token).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:56:39 -07:00
test-user 3a1c5427c8 feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label
Collected live C2PA positives from Bing Image Creator and Stability Brand
Studio (DreamStudio successor) and learned two things our scan got wrong:

- Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs
  C2PA as 'Microsoft'. The scan caught it, but the platform label claimed
  'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral:
  'Microsoft (Bing Image Creator / Designer)'.
- Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which
  was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer
  and a platform mapping. Stability uses no SynthID and (on its current
  Stable Image model) no imwatermark watermark -- verified, both negative.

Both ingested as SynthID-negative corpus fixtures (they are AI but not
SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded
design exports that strip C2PA, so a Canva sample would be inconclusive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 17:12:42 -07:00
test-user 27ad5b7645 feat(identify): detect open SD/SDXL/FLUX invisible watermark
Research found one locally-fillable detection gap: Stable Diffusion, SDXL,
and FLUX all embed an open DWT-DCT watermark via the invisible-watermark
(imwatermark) library -- a PUBLIC decoder, no secret key, unlike SynthID.
New invisible_watermark.py decodes the known fixed patterns (verified
against upstream source: diffusers SDXL WATERMARK_MESSAGE, FLUX.2
src/flux2/watermark.py, and the 'StableDiffusionV1' default string) and
identify() reports the scheme as a high-confidence signal.

Verified locally end-to-end: embedding SDXL's exact 48-bit message and
decoding it back recovers 48/48 bits; a clean image and our own fal-SDXL
outputs decode to ~21/48 (no match). Caveat baked into the report: the
watermark is fragile -- gone after JPEG q90 -- so it confirms origin only
on pristine files; absence is never proof.

imwatermark is an optional dep (extra 'detect'; pulls non-headless opencv),
so the import is guarded and the signal is skipped when absent. CLI
--no-visible now means metadata-only (skips both pixel-domain detectors).

Also records the broader watermarking landscape in CLAUDE.md: which
services are locally detectable (SD/SDXL/FLUX), C2PA-covered (Bing/Canva/
Getty/Shutterstock unsampled), or proprietary-only like SynthID (Amazon
Titan/Nova, Kakao). Midjourney embeds neither C2PA nor an invisible mark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:53:59 -07:00
test-user 1a9f3e4fe5 test(identify): cover provenance branches, CLI, sparkle helper
Adds 20 tests around the new provenance path:

- identify(): local SD/ComfyUI params -> local-pipeline attribution;
  visible-sparkle gating at the 0.5 threshold (mocked detector: above,
  below, unavailable, opt-out); metadata verdict not downgraded by a
  sparkle hit; OpenAI/SynthID caveats + dedup; ProvenanceReport is
  JSON-serializable (the CLI --json path); and the honest edge where a
  C2PA manifest without an AI source marker stays 'unknown'.
- CLI 'identify': help, clean PNG, AI PNG platform, valid --json,
  missing file.
- gemini_engine.detect_sparkle_confidence: float in range for a real
  image, None for an unreadable file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:27:00 -07:00
test-user fa104bcade feat(identify): provenance command (platform + watermark inventory)
New 'identify' command and identify.py module: upload an image, get one
ProvenanceReport answering where it was made and what watermarks it
carries. Aggregates every locally-readable signal:

- C2PA Content Credentials -> generating platform (issuer + generator).
- IPTC digitalSourceType 'Made with AI' (Meta and others).
- Embedded SD/ComfyUI generation parameters (local pipelines).
- SynthID metadata proxy (Google / OpenAI C2PA companion).
- Visible Gemini sparkle (cv2 fallback for the stripped-metadata case),
  promoted only at confidence >= 0.5 (corpus-tuned: Gemini sparkles
  score >= 0.56, non-sparkle <= 0.49).

is_ai_generated is True or None, never asserted False -- stripped
metadata leaves no local proof of a clean origin, so absence of signals
is reported as 'unknown' with an explicit caveat. The SynthID *pixel*
watermark remains locally undecodable; the report says so.

Non-PNG containers (JPEG/WebP/AVIF/HEIF/JXL) get the same issuer +
generator attribution via a binary scan (the caBX parser is PNG-only).
The cv2 dependency is isolated in gemini_engine.detect_sparkle_confidence
so identify.py stays type-clean. CLI supports --json and --no-visible.

Validated against the 109-image corpus: 14/14 positives flagged AI,
93/94 negatives clean (the one 'neg' flagged is a Meta image that
genuinely carries the IPTC tag -- correct), zero true errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:19:26 -07:00