From 78e2ae65ad8f263d6dac7cb5d75c5afc558ab37f Mon Sep 17 00:00:00 2001 From: Victor Kuznetsov Date: Mon, 22 Jun 2026 09:41:23 -0700 Subject: [PATCH] docs: neutralize local-pull path reference in doubao research note Replace the `data/spaces/originals/` path with a generic "local corpus of pristine originals" so the committed public doc carries no reference to the local working-data pull (the data itself is gitignored). The analysis scripts' default paths are left untouched (operational tooling, no content/provenance). Co-Authored-By: Claude Opus 4.8 --- docs/research-doubao-distillation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/research-doubao-distillation.md b/docs/research-doubao-distillation.md index 925072b..1932a63 100644 --- a/docs/research-doubao-distillation.md +++ b/docs/research-doubao-distillation.md @@ -10,7 +10,7 @@ **Conclusion (historical): pure reverse-alpha distilled from content images does NOT work, and the blocker is the WRONG kind of data, not too little of it.** -The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- `data/spaces/originals/` holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.** +The earlier framing ("need ~5-8 PRISTINE same-resolution originals") is obsolete -- a local corpus of pristine originals holds plenty. Curate them with `DoubaoEngine.detect` + an NCC filter against a clean glyph template, keeping only marks at offset ≈ (0,0): that yields e.g. **15 pixel-aligned 2048² marks** (sub-pixel drift, not the ±50 px the old lossy/mixed-res scrapes had), plus 1086x1448 / 1792x2400 clusters. With those, LaMa-clean `O` + weighted-LS (and per-pixel I-on-O regression) for `α` (+ logo colour) was tried end-to-end and **still leaves a persistent ghost outline.** Diagnosed why, empirically (cached stacks, `/tmp/doubao_distill`): (1) the mark is a clean white overlay with **no dark halo** -- over glyph pixels ~54% are brighter than the clean bg, only ~4% darker -- so the white-logo model `I=(1-α)O+α·255` is correct; (2) but content backgrounds are almost never dark *under* the mark (median darkest available bg over glyph pixels = **58/255**; only ~13% of mark pixels are ever observed on a bg < 40), so on bright backgrounds the equation is ill-conditioned and `α` is unidentifiable; (3) LaMa's `O` is a plausible **hallucination**, not the true pre-mark background, which compounds the error, and per-pixel regression on ~15 obs overfits into colour noise.