fix(photomaker-v2): render at SDXL native 1024, use upstream prompt + neg_prompt

The 9-face grid + single-face cert outputs were still mosaic of training-time
faces even after the id_embeds shape fix. WebFetch of the upstream
inference_pmv2.py revealed three mismatches:

1. SDXL at width=height=512 falls into its low-res failure mode (small-detail
   collage / mosaic) on the V2 LoRA. Render at native 1024 then downscale into
   the original face bbox at composite time.
2. Upstream prompt is descriptive ("instagram photo, portrait photo of a woman
   img, colorful, perfect face, natural skin, hard shadows, film grain, best
   quality"). Our generic prompt let SDXL drift away from the ID embedding.
   Adopted the upstream pattern.
3. Upstream V2 explicitly passes negative_prompt; the CFG batch-mismatch we hit
   on V1 isn't a V2 issue. Re-added negative_prompt with the upstream wording
   (asymmetry/worst quality/etc).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Victor Kuznetsov
2026-06-08 19:11:48 -07:00
parent b1fed810fd
commit c486badaa8
+21 -12
View File
@@ -76,12 +76,25 @@ _SDXL_MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
# `img`, which PhotoMaker replaces with the ID embedding at inference. Keeping it
# scene-neutral (no extra style words) maximises identity transfer from the embed and
# minimises hallucinated background/lighting that would not match the cleaned scene.
_PHOTOMAKER_PROMPT = "a portrait photo of a person img, natural lighting, sharp focus"
_PHOTOMAKER_NEGATIVE = "blurry, lowres, deformed, distorted, watermark"
# Prompt format follows the upstream V2 reference (inference_pmv2.py): the trigger
# word ``img`` must immediately follow a class noun. SDXL is happiest at 1024 and
# falls into low-res artefacts ("mosaic of tiny faces") at 512, so we render at
# 1024 then downscale into the face bbox at composite time. Caught visually
# 2026-06-04: at 512 V2 produced a collage of training-time faces; at 1024 with the
# upstream-style descriptive prompt it produces a clean face.
_PHOTOMAKER_PROMPT = (
"instagram photo, portrait photo of a person img, natural skin, soft lighting, "
"best quality, sharp focus"
)
_PHOTOMAKER_NEGATIVE = (
"(asymmetry, worst quality, low quality, illustration, 3d, 2d, painting, "
"cartoons, sketch), open mouth, blurry, watermark"
)
# Square size used to feed PhotoMaker (must match a multiple of 64; 512 fits CPU/GPU
# comfortably and gives the encoder enough pixels for a stable embedding).
_PHOTOMAKER_FACE_SIZE = 512
# SDXL native resolution; lower values send V2 into low-res mode and the output
# becomes a collage of training-time faces. We render at 1024 then downscale into
# the original face bbox at composite time.
_PHOTOMAKER_FACE_SIZE = 1024
_pipeline: Any | None = None
_pipeline_lock = threading.Lock()
@@ -334,15 +347,11 @@ def restore_faces_photomaker(
id_crop_rgb = cv2.cvtColor(id_crop_bgr, cv2.COLOR_BGR2RGB)
id_image_pil = Image.fromarray(id_crop_rgb)
# Don't pass negative_prompt: the PhotoMaker pipeline manages its own CFG by
# concatenating [negative_prompt_embeds, prompt_embeds]; if we pass a custom
# negative the upstream code splits text_only vs id-injected branches and
# the resulting embed batch dims can mismatch (we saw
# "Sizes of tensors must match except in dimension 1. Expected size 2 but got
# size 1" on a real run). The default empty negative is what the upstream
# gradio demo uses.
# Upstream V2 reference (inference_pmv2.py) passes negative_prompt; the
# batch-mismatch we hit earlier was on V1 only.
out = pipeline(
prompt=_PHOTOMAKER_PROMPT,
negative_prompt=_PHOTOMAKER_NEGATIVE,
input_id_images=[id_image_pil],
id_embeds=id_embeds,
num_inference_steps=num_inference_steps,