docs: neutralize local-pull path reference in doubao research note

Replace the `data/spaces/originals/` path with a generic "local corpus of
pristine originals" so the committed public doc carries no reference to the
local working-data pull (the data itself is gitignored). The analysis scripts'
default paths are left untouched (operational tooling, no content/provenance).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-22 09:41:23 -07:00
parent 719c7e9313
commit 78e2ae65ad
+1 -1
View File
@@ -10,7 +10,7 @@
**Conclusion (historical): pure reverse-alpha distilled from content images does NOT work, and the blocker is the WRONG kind of data, not too little of it.**
The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- `data/spaces/originals/` holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.**
The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- a local corpus of pristine originals holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.**
Diagnosed why, empirically (cached stacks, `/tmp/doubao_distill`): (1) the mark is a clean white overlay with **no dark halo** -- over glyph pixels ~54% are brighter than the clean bg, only ~4% darker -- so the white-logo model `I=(1-α)O+α·255` is correct; (2) but content backgrounds are almost never dark *under* the mark (median darkest available bg over glyph pixels = **58/255**; only ~13% of mark pixels are ever observed on a bg < 40), so on bright backgrounds the equation is ill-conditioned and `α` is unidentifiable; (3) LaMa's `O` is a plausible **hallucination**, not the true pre-mark background, which compounds the error, and per-pixel regression on ~15 obs overfits into colour noise.