ORT's CoreML EP GatherOpBuilder::IsOpSupportedImpl explicitly rejects
rank-0 (scalar) index tensors. StyleGAN-derived models (GFPGAN's 1024
variant has 16 of them, one per style-code slice) hit this in the
generator, and the resulting CPU fallbacks split the CoreML subgraph
into multiple partitions with boundary crossings on every inference.
Add a load-time ONNX rewrite that promotes each scalar index to [1] and
squeezes the added axis on the Gather output — semantically identical
but CoreML-compatible. GFPGAN now runs as a single CoreML partition with
zero CPU-fallback nodes; inference drops from ~87 ms to ~81 ms on an
M-series Mac.
The fix has been filed upstream as microsoft/onnxruntime#28180 — the
existing code comment in gather_op_builder.cc already describes this
exact workaround, it just isn't applied. Once the upstream fix ships
and the ORT floor is raised, this pass can be deleted.
Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single
NxN template keyed by N, which is correct for the inswapper model
(128x128 aligned-face space) but would silently mis-warp if a caller
ever passed a non-square aligned face. Assert the shape instead of
silently assuming it.
The processing thread was running cvtColor on the full-resolution 1920×1080
frame before queueing it for display. Since the display thread immediately
resizes the frame to the preview window (~5× smaller pixel count), doing
the colour conversion on the resized buffer is cheaper overall.
Processing thread now queues BGR; display thread resizes then cvtColor.
Both face_swapper._fast_paste_back and face_enhancer._paste_back were
doing a numpy float32 round-trip per frame: convert the target crop and
the warped face to float32, blend, clip, cast back to uint8. That's four
crop-sized allocations plus unvectorized elementwise math.
Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add,
which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored
alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no
conversion is needed per frame. CUDA paths also simplified — upload uint8
alpha (less bandwidth) and scale on device.
Micro-bench on 1000x1000 RGB crop:
current (float32 numpy): 9.43 ms
cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255)
Visual diff is imperceptible (quantization noise in the last step).
_fast_paste_back used to erode and Gaussian-blur the warped alpha mask in
output coordinates with kernel sizes proportional to the on-screen face
bbox. That made the per-frame cost ~O(area * k^2) — a face filling half
the frame took ~8x the compositing work of one filling a quarter, which
is why FPS fell off when leaning into the camera.
Instead, build a feathered alpha template once at aligned-face resolution
(128x128 for inswapper) and warp the soft mask per-frame. The affine
transform preserves the relative feather width, so the visual output is
equivalent; the per-frame cost is now O(crop_area) with no size-scaled
erode/blur and no size-scaled padding.
Also collapses the CPU fallback onto the same shape — it previously did
a full-frame warpAffine twice per call, which scaled with the whole
frame instead of the face crop.
- core._run_pipe_pipeline: hand the background detector its own copy of
the frame. The frame processors mutate in place via paste-back, which
was racing with concurrent face detection on the same buffer.
- face_swapper._init_cuda_graph_session: replace the
`swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter`
that proxies every attribute to the underlying session and only
overrides `.run()`. Guarded so repeat init does not double-wrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced in post-squash review of f65aeae:
1. CUDA-graph replay buffers were shared across threads with no lock.
`_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent
and runs run_with_iobinding — concurrent swap calls on Windows/CUDA
could overwrite each other's bound input buffers before replay,
producing wrong-face output. Added `_cuda_graph_lock` around the
full update/run/read sequence.
2. Face enhancer loop unconditionally broke after the first face, so
`many_faces=True` silently enhanced only one face. Also, the
single-slot temporal cache would paste the same enhancement onto
every target if reused in many-faces mode. Gated the break on
`not many_faces_mode` and disabled the cache path in that mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA
fixes, and Mac/Windows runtime routing into a single drop.
CoreML (Apple Silicon):
- Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model
runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream
in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0).
- Fold Shape/Gather chains to constants in det_10g (21ms → 4ms).
- Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms).
- Route detection model to GPU so the ANE is free for the swap model.
- Centralize provider/config selection in create_onnx_session.
Pipeline (all platforms):
- Parallelize face landmark + recognition post-detection; skip landmark_2d_106
when only face_swapper is active.
- Pipeline face detection with swap for ANE overlap.
- GPU-accelerated paste_back, MJPEG capture, zero-copy display path.
- Standalone pipeline benchmark script.
Windows / CUDA:
- CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS.
- Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution.
Cross-platform:
- platform_info helper for Mac/Windows runtime routing.
- GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.
All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.
Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.
The live webcam preview in ui.py calls process_frame_v2() on all
frame processors, but face_enhancer.py was missing this method.
This caused an AttributeError crash when the GFPGAN face enhancer
was enabled during live mode.
Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change default face swapper model to FP32 for better GPU compatibility and avoid NaN issues on certain GPUs.
Revamped `run.py` to adjust PATH variables for dependencies setup and re-added with expanded configuration.
Add ToolTip class (modules/ui_tooltip.py) and wire descriptive hover
tooltips onto every button, switch, slider, and dropdown in the main
window. Tooltips appear after a 500ms hover delay and are clamped to
screen bounds.
This requires no new dependencies — ToolTip uses only customtkinter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cv2_enumerate_cameras(CAP_AVFOUNDATION) probes indices 0-99 through
OpenCV's AVFoundation backend, which intermittently segfaults (exit
code 139) when invalid device indices are probed. Replace with a
bounded cv2.VideoCapture loop (range(10)) that safely skips
unavailable indices.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace float64 with float32 in apply_mouth_area() blending masks —
float32 provides sufficient precision for 8-bit image blending and
halves memory bandwidth
- Use float32 in apply_mask_area() mask computations
- Vectorize hull padding loop in create_face_mask() (face_masking.py)
replacing per-point Python loop with NumPy array operations
- Fix apply_color_transfer() to use proper [0,1] LAB conversion —
cv2.cvtColor with float32 input expects [0,1] range, not [0,255]
- Pre-compute inverse masks to avoid repeated (1.0 - mask) subtraction
- Use np.broadcast_to instead of np.repeat for face mask expansion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move opacity calculation before frame copy to skip the copy when
opacity is 1.0 (common case). Add early return path for full opacity.
Clear PREVIOUS_FRAME_RESULT instead of caching when interpolation
is disabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a dedicated detection thread that runs face detection continuously
on the latest captured frame and publishes results to a shared dict.
The processing/swap thread reads cached detection results instead of
running detection inline, so it never blocks on the 15-30ms detection
cost.
Architecture change: 2 threads → 3 threads
Before: capture → [detect + swap] → display
After: capture → swap (uses cached detections) → display
↘ detect (async, writes to shared cache) ↗
Also replaces the blocking while/ROOT.update() display loop with
ROOT.after()-based scheduling, which avoids Tk event loop re-entrancy
issues and UI freezes.
Closes#1664
In Tk 9.0, Menu.index("end") returns "" instead of raising TclError
on empty menus. CustomTkinter's DropdownMenu._add_menu_commands
doesn't handle this case, causing a crash when creating CTkOptionMenu
widgets (e.g., the camera selector dropdown).
Add a monkey-patch that guards against the empty-string return value.