remove-ai-watermarks

mirror of https://github.com/wiltodelta/remove-ai-watermarks.git synced 2026-07-25 00:50:48 +02:00

Author	SHA1	Message	Date
Victor Kuznetsov	8523f48fb6	data(corpus): archive June 2026 SynthID strength-study subjects Back docs/synthid.md section 2.2 with the actual test set: the per-image oracle-verified subjects were only in a local working dir, while the doc claimed they were recorded in data/synthid_corpus/. Ingest the key pos+cleaned pairs so the claim holds. - pos: openai_1/2/3 originals (gpt-image, openai-verify) + gemini_1/2/3/4 originals (Gemini app, gemini-app); all probe as C2PA-SynthID present. - cleaned: OpenAI at strength 0.05 (openai_2 only s010 captured) + Gemini at 0.15 --max-resolution 1536; oracle: SynthID NOT detected. Metadata stripped, so no C2PA on the cleaned rows. - Excluded the third-party issue #14 image (pic3): oracle-verified but not committed to the public corpus. - docs/synthid.md 2.2: state OpenAI n=4 = 3 archived + 1 external-only. - CLAUDE.md: drop the drift-prone "~65 MB" corpus size from the sdist note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:09:58 -07:00
Victor Kuznetsov	e572767555	feat(visible): add Jimeng remover, fix Doubao outline defect, reproducible mask build Visible-watermark work across all three corner-mark engines plus a committed, reproducible alpha-build pipeline (scripts/visible_alpha_solve.py) fed by committed solid black/gray/white captures. - jimeng: new "即梦AI" wordmark remover (reverse-alpha + thin residual inpaint, always NCC-aligned -- the mark re-rasterizes/jitters per image). Detect via glyph silhouette NCC (0.45 threshold; does not cross-fire with Doubao). Registered in the visible-mark catalog; `visible --mark jimeng` / `--mark auto`. - doubao: fix a real production defect -- the shipped remover left a READABLE "豆包AI生成" outline on real samples while detect() returned conf 0.0 (fooled by a thin outline), so the test passed and the "56/56 clean" claim was detector-measured, not visual. Root cause: under-estimated alpha + fixed-geometry-no-inpaint + tight locate box. Rebuilt alpha (careful gray-self solve), always-align, thin inpaint, widened locate box -> readable outline becomes faint texture-level traces. - gemini: rebuild gemini_bg_{96,48} from our own controlled captures (validated NCC 0.9998 vs the prior third-party asset); removal re-verified clean, no behaviour change. - tests: add textured-shift regression to both engines (guards the align-on-shift path the Doubao defect exposed; lesson: a detector-only removal test is insufficient, assert visual residual). - docs: CLAUDE.md, README, capture READMEs and docstrings synced; stale "exact/pixel-exact/56-clean" claims removed. Also includes a SynthID label-wording clarification in identify.py/cli.py ("SynthID pixel watermark" -> "SynthID watermark, inferred from C2PA metadata"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 12:20:19 -07:00
Victor Kuznetsov	5d0e6c3a65	fix: harden metadata parsers and engines; sync docs (full-repo review) Apply fixes from a full-repo review (code, tests, docs). Security / correctness: - Clamp attacker-controlled PNG/caBX chunk lengths to the remaining file size in metadata.py and noai/c2pa.py (a malformed length no longer drives a multi-GB read); skipped chunks seek instead of read. - noai/isobmff.strip_c2pa_boxes is now fail-safe on a malformed box: return the original bytes with a warning instead of silently truncating the tail, so metadata --remove can no longer emit a corrupt file. - doubao_engine._fixed_alpha_map clamps the glyph box to the image (no crash on degenerate width-vs-height). - watermark_remover._run_region_hires gates the phaseCorrelate offset on response and magnitude (a spurious shift no longer garbles text) and drops the generator after a CPU fallback (no MPS/CPU device mismatch). Robustness: - gemini_engine, doubao_engine, region_eraser normalize grayscale and RGBA inputs to BGR at the engine entry points. - image_io.imwrite returns False on an unwritable path (matches cv2). - invisible_engine guards a None imread result before use. - trustmark_detector._decoder uses a double-checked threading lock. - ctrlregen.tiling.tile_positions raises on overlap >= tile. - humanizer chromatic shift no longer wraps opposite-edge pixels. - identify OpenAI caveat keyed on the normalized vendor, not a substring. - Remove the dead "visible --detect-threshold" CLI option. - publish.yml verifies the release tag matches the package version. Docs: - README strength 0.05 to 0.10; .env.example HF_TOKEN marked optional; doubao_capture README updated to reverse-alpha-only; CLAUDE.md synced with the new behaviors and the batch command. Tests: new test_security_clamp.py for the read clamp and isobmff fail-safe; erase CLI coverage; integrity-clash rule 2 end-to-end; multi-tag EXIF survival and cross-format strip guards; channel/size, tiling, humanizer, and imwrite regressions. Full suite 493 passed, 2 skipped; ruff and pyright src/ clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 18:00:39 -07:00
test-user	bc3228d387	feat(visible): Doubao text-mark removal + universal region eraser Add deterministic, CPU-only removal of the visible Doubao "豆包AI生成" mark and a position-agnostic region eraser for any other visible watermark/logo. - doubao_engine.py: locate (geometry, scales with width) + polarity-aware white-top-hat glyph mask + cv2 inpaint; coverage-gated detection and a dense-text safety guard. No GPU, ~30ms. - region_eraser.py + `erase` command: inpaint arbitrary --region box(es). Default cv2 backend (no deps); optional big-LaMa via onnxruntime (`lama` extra, Carve/LaMa-ONNX, model downloaded on first use, never bundled). - cli `visible --mark auto\|gemini\|doubao`: auto routes by detector confidence. - tests for both engines; seed previously-unseeded CLI image fixtures to stop the Doubao detector flaking on random corners. - .gitignore: doubao_capture/{seeds,captures} scratch (alpha-map calibration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:31:51 -07:00
test-user	74618b91a7	feat: detect xAI/Grok EXIF signature; refresh watermarking landscape (v0.5.5) xAI Grok (Aurora) images carry no C2PA/SynthID/IPTC -- their only provenance signal is an EXIF pair: ImageDescription "Signature: <base64>" + a UUID Artist. Verified stable across 3 genuine generations (a real download previously read as unknown / "no AI metadata"). - metadata.xai_signature(): matches the Signature blob + UUID Artist pair; wired into has_ai_metadata, get_ai_metadata, and identify (platform "xAI (Grok / Aurora)"). - data/samples/grok-1.jpg: real Grok fixture (neutral content; the Artist UUID is the public image id, not PII). - Tests: synthetic-fixture unit tests, real-sample assertion, identify integration (322 passing). Docs (research refresh, May 2026): - C2PA 2.4 Durable Content Credentials (soft-binding re-discovery after the embedded manifest is stripped). - New AI-labeling laws, primary-source verified: EU AI Act Art 50 (2026-08-02), South Korea AI Framework Act Art 31(3), California AB 853. - Hedge removal claims: defeating the SynthID verifier is not forensic invisibility (arXiv:2605.09203); cite SynthID-Image (arXiv:2510.09263). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 14:14:35 -07:00
test-user	03fb460f77	Track the labeled SynthID corpus; complete metadata-source test coverage Corpus images were gitignored (local-only). The negatives were reviewed and cleared for publishing, so the labeled set is now committed (regular git, 65 MB across 25 files) -- making the removal regression set reproducible and CI-able. Corpus: - Track data/synthid_corpus/images/ (pos 9, neg 15, cleaned 1); keep only the synthetic refs/ calibration fills gitignored. - Reconcile manifest.csv to the on-disk files: 117 -> 25 rows (92 dangling rows for removed images pruned; dedup left one cleaned output, f6dd47a5). - Rewrite the corpus README layout/policy (images committed; review every image for private content before adding -- public repo, permanent history). Test fixtures: - Remove data/samples/not-ai-1/2/3 (personal iPhone photos, incl. GPS EXIF). - Add the clean_photo conftest fixture serving a verified-negative image from the corpus neg/ set; repoint the three "non-AI / clean photo" tests onto it (skips if the corpus is absent). Metadata-source coverage (close the last sub-variant gaps): - c2pa digitalSourceType: algorithmicMedia (procedural, not flagged AI) and compositeWithTrainedAlgorithmicMedia (AI + SynthID proxy). - exif_generator: EXIF Artist and ImageDescription fields (Software/Make/XMP CreatorTool were already covered). All 8 metadata-source kinds are now tested at both the unit and identify() level. 313 tests pass. CLAUDE.md updated (corpus tracked, clean_photo fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:46:47 -07:00
test-user	59d72c5db7	Record verified gpt-image-2 SynthID-cleaned chain in corpus Add manifest row for the 4ef377bd -> f6dd47a5 chain: a gpt-image-2 sample (openai.com/verify: SynthID + C2PA detected) cleaned via v0.5.3 `all` at native 1254x1254 (prod-equivalent SDXL base, strength 0.05, 50 steps). openai.com/verify reports SynthID NOT detected after the run, re-confirming that the #10 native-resolution default defeats OpenAI SynthID and resolving the #15 root cause (older SD-1.5/768px downscale default did not). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 13:55:10 -07:00
test-user	1afc1e60ef	test(samples): add real Doubao TC260 AIGC reference sample 2048x2048 PNG carrying China's TC260 <TC260:AIGC> label; identify reports it as a China AIGC-labeled generator (TC260). Reference fixture for manual re-verification of the TC260 detection path -- the automated tests use synthetic blobs, so nothing depends on this file being present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:36:28 -07:00
test-user	768d997ef0	docs: scope SynthID provenance claims to source-verified facts Threat model: replace the unverified deployment list (Gemini 3 Pro / Nano Banana Pro / Imagen 4 / Veo) with the source-verified scope -- SynthID across Imagen / Veo / Lyria plus Gemini app outputs (>10B items by Dec 2025), and attribute the 136-bit payload to the paper's SynthID-O variant. openai-images-2 sample: note the file predates the 19 May 2026 SynthID rollout across ChatGPT / Codex / API, and that openai.com/verify is now the public oracle (still no local decoder). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:18:13 -07:00
test-user	fb42295b3a	docs: record verified fal fast-sdxl checkpoint + native-resolution updates - fal's llms.txt confirms fast-sdxl is stabilityai/stable-diffusion-xl-base-1.0, the exact checkpoint the local CLI defaults to -> local == prod weights. Recorded in CLAUDE.md and README. - README How it works + sample README: replace the old downscale->upscale description with native-resolution processing (matches the #10 fix); document --max-resolution as an opt-in OOM cap. - README roadmap: idna already bumped (uv-secure clean). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 09:57:03 -07:00
test-user	93c664f7fb	docs: sync README + corpus map with v0.5.x detection coverage - README Features: add the identify / provenance-detection capability. - README Supported models: add FLUX, Stability AI, Microsoft/Bing (MAI-Image), Meta AI rows; note SD/SDXL/FLUX imwatermark is locally detectable; add a detection note pointing at identify. - corpus README per-platform map: add Stability / Ideogram / Recraft / Krea-FLUX rows + an imwatermark column; correct Bing (MAI-Image, signs 'Microsoft'); note imwatermark fires only on pristine pipeline output, not re-hosts/exports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 09:32:03 -07:00
test-user	ede35a3db5	feat(metadata): read EXIF Make tag; collect Ideogram/Recraft/Krea-FLUX Collected live samples from three popular generators we lacked: - Ideogram tags its downloads with EXIF Make="Ideogram AI" (no C2PA, no SynthID, no imwatermark) -- the Make tag is its only signal. exif_generator only read Software/Artist/ImageDescription, so it missed this; now reads Make too. Real cameras put "Apple"/"Canon" in Make (no AI token), so this stays low-false-positive. 4 originals ingested. - Recraft (PNG export) and Krea hosting FLUX 2: downloads carry NO detectable signal -- no C2PA/EXIF/IPTC, and notably no imwatermark despite Krea running FLUX. identify correctly reports 'unknown'. Both ingested as neg fixtures. Lesson recorded in CLAUDE.md: the imwatermark detector fires only on pristine output from a pipeline that runs the encoder (diffusers default, official BFL), not from re-hosts (Krea/Stability) or re-encoded exports (Recraft/Canva). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 18:38:56 -07:00
test-user	3a1c5427c8	feat(c2pa): recognize Stability AI issuer; fix Microsoft platform label Collected live C2PA positives from Bing Image Creator and Stability Brand Studio (DreamStudio successor) and learned two things our scan got wrong: - Bing now runs Microsoft's own MAI-Image model, not DALL-E, and signs C2PA as 'Microsoft'. The scan caught it, but the platform label claimed 'Microsoft Designer (DALL-E / OpenAI backend)'. Relabeled model-neutral: 'Microsoft (Bing Image Creator / Designer)'. - Stability signs C2PA as 'Stability AI' (cert 'Stability AI Ltd'), which was not in C2PA_ISSUERS, so it read as 'unknown signer'. Added the issuer and a platform mapping. Stability uses no SynthID and (on its current Stable Image model) no imwatermark watermark -- verified, both negative. Both ingested as SynthID-negative corpus fixtures (they are AI but not SynthID) for issuer-coverage. Canva skipped: its downloads are re-encoded design exports that strip C2PA, so a Canva sample would be inconclusive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 17:12:42 -07:00
test-user	af787fd8d6	docs(corpus): per-platform watermark map + surface-dependent blind spot Grow the SynthID corpus to 109 originals (91 iPhone-photo negatives, 2 positives) and document what was learned studying 8 platforms: - README: per-platform watermark map (C2PA issuer / SynthID pixel / IPTC / visible sparkle per platform) and an "originals, not previews" note (re-encoded previews strip metadata, so a clean preview is not proof). - CLAUDE.md: surface-dependent blind spot -- the same Google model wraps C2PA in the Gemini app but emits the SynthID pixel watermark + sparkle with no C2PA/IPTC via the API/playground (AI Studio, Nano Banana), so synthid_source returns None despite SynthID being present; only the pixel oracle or the visible-sparkle detector catches those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 15:55:17 -07:00
test-user	da0edcbddc	chore(corpus): grow SynthID reference set + document autonomous Chrome collection Adds content positives (OpenAI gpt-image: forest, fisherman, tokyo; Google gemini: fisherman, mug) and SDXL/non-SynthID negatives to the local corpus manifest. Now spans 4 resolutions across 2 vendors (was solid-black only). README: documents driving generation via Chrome MCP -- Gemini single-click download; ChatGPT via in-page fetch+blob (preserves original C2PA bytes, unlike the flaky UI download / a canvas re-encode). Images stay gitignored; only the manifest (sha256 + labels + extracted metadata) and protocol are tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:46:46 -07:00
test-user	f07ce10c72	feat(metadata): SynthID-source detection, C2PA parser consolidation, corpus + tests Detect SynthID-bearing images via their C2PA companion: a manifest signed by a SynthID-using vendor (Google/OpenAI) on AI-generated content implies an invisible SynthID pixel watermark. Verified end-to-end against the vendor oracles (openai.com/verify, Gemini "Verify with SynthID"). - metadata: synthid_source() + synthid_watermark verdict in get_ai_metadata, surfaced as a `metadata --check` callout. Format-agnostic (PNG caBX parser + JPEG/WebP/AVIF/HEIF/JXL binary scan). - constants: SYNTHID_C2PA_ISSUERS {Google, OpenAI}; +opened/placed actions. - c2pa: single CBOR-aware parser (_cbor_text_after) replaces glitchy regex (fixes fGPT-4o claim_generator); removed duplicate _scan_png_c2pa_chunk from metadata; shared synthid_verdict / synthid_vendors_in helpers. - corpus: scripts/synthid_corpus.py ingest tool + data/synthid_corpus/ (manifest tracked, images gitignored) for a labeled reference set. - tests: +38 across C2PA parser internals, extract/inject round-trip, ISOBMFF container stripping, all IPTC AI markers, and invisible watermark strength tiers (SynthID/StableSignature/TreeRing/StegaStamp/RingID/RivaGAN/...). Pixel-level SynthID detection remains out of reach locally (Google's decoder is proprietary); a from-scratch spectral pilot confirmed it does not separate real content. See CLAUDE.md for the full evaluation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 11:32:46 -07:00
test-user	87d02126e3	feat(metadata): parse C2PA JUMBF manifest fields, add Images 2.0 sample, bump to 0.3.4 - metadata --check now shows claim_generator, c2pa_spec, digital_source_type, c2pa_actions, signer instead of empty table for C2PA-only files - reuses existing extract_c2pa_chunk() from noai/c2pa.py — no more duplicate PNG chunk parsing or full-file reads - adds data/samples/openai-images-2/amur-leopard.png: real gpt-image-2 output with C2PA manifest signed by OpenAI OpCo LLC / Trufo CA (spec 2.2.0) - removes stale data/samples/nano-banana-1/2.png (no longer referenced) - updates README: new Images 2.0 row in supported models table - documents known text-degradation limitation in CLAUDE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 17:21:51 -07:00
test-user	1890848ec3	SEO-optimized README, add sample images from multiple AI models - Rewrite README for SEO: Nano Banana, SynthID, Made with AI, C2PA keywords - Add Supported Models table with 7 AI services - Add 'Made with AI' label removal to features - Rename sections for search discoverability - Add samples: ChatGPT/DALL-E, Midjourney, Adobe Firefly - Reorganize data/samples with flat structure and clear naming	2026-03-25 17:23:24 -07:00
test-user	e5d8970add	Add project files, tests, and documentation for GitHub release - CLI with visible, invisible, all, metadata, and batch commands - Gemini watermark removal via reverse alpha blending - Invisible watermark removal via diffusion regeneration (SynthID, TreeRing) - AI metadata stripping (EXIF, PNG text, C2PA) - Face protection (YOLO/Haar) and analog humanizer - 137 tests covering all CLI modes and core engines - Ruff and Pyright clean	2026-03-25 11:15:05 -07:00

19 Commits