ORT's CoreML EP GatherOpBuilder::IsOpSupportedImpl explicitly rejects
rank-0 (scalar) index tensors. StyleGAN-derived models (GFPGAN's 1024
variant has 16 of them, one per style-code slice) hit this in the
generator, and the resulting CPU fallbacks split the CoreML subgraph
into multiple partitions with boundary crossings on every inference.
Add a load-time ONNX rewrite that promotes each scalar index to [1] and
squeezes the added axis on the Gather output — semantically identical
but CoreML-compatible. GFPGAN now runs as a single CoreML partition with
zero CPU-fallback nodes; inference drops from ~87 ms to ~81 ms on an
M-series Mac.
The fix has been filed upstream as microsoft/onnxruntime#28180 — the
existing code comment in gather_op_builder.cc already describes this
exact workaround, it just isn't applied. Once the upstream fix ships
and the ORT floor is raised, this pass can be deleted.
Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single
NxN template keyed by N, which is correct for the inswapper model
(128x128 aligned-face space) but would silently mis-warp if a caller
ever passed a non-square aligned face. Assert the shape instead of
silently assuming it.
The processing thread was running cvtColor on the full-resolution 1920×1080
frame before queueing it for display. Since the display thread immediately
resizes the frame to the preview window (~5× smaller pixel count), doing
the colour conversion on the resized buffer is cheaper overall.
Processing thread now queues BGR; display thread resizes then cvtColor.
Both face_swapper._fast_paste_back and face_enhancer._paste_back were
doing a numpy float32 round-trip per frame: convert the target crop and
the warped face to float32, blend, clip, cast back to uint8. That's four
crop-sized allocations plus unvectorized elementwise math.
Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add,
which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored
alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no
conversion is needed per frame. CUDA paths also simplified — upload uint8
alpha (less bandwidth) and scale on device.
Micro-bench on 1000x1000 RGB crop:
current (float32 numpy): 9.43 ms
cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255)
Visual diff is imperceptible (quantization noise in the last step).
_fast_paste_back used to erode and Gaussian-blur the warped alpha mask in
output coordinates with kernel sizes proportional to the on-screen face
bbox. That made the per-frame cost ~O(area * k^2) — a face filling half
the frame took ~8x the compositing work of one filling a quarter, which
is why FPS fell off when leaning into the camera.
Instead, build a feathered alpha template once at aligned-face resolution
(128x128 for inswapper) and warp the soft mask per-frame. The affine
transform preserves the relative feather width, so the visual output is
equivalent; the per-frame cost is now O(crop_area) with no size-scaled
erode/blur and no size-scaled padding.
Also collapses the CPU fallback onto the same shape — it previously did
a full-frame warpAffine twice per call, which scaled with the whole
frame instead of the face crop.
- core._run_pipe_pipeline: hand the background detector its own copy of
the frame. The frame processors mutate in place via paste-back, which
was racing with concurrent face detection on the same buffer.
- face_swapper._init_cuda_graph_session: replace the
`swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter`
that proxies every attribute to the underlying session and only
overrides `.run()`. Guarded so repeat init does not double-wrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs
hacksider/Deep-Live-Cam main@64d3f06:
- Face swap only: <5 FPS -> >20 FPS
- Face swap + GFPGAN: <2 FPS -> >10 FPS
- Camera: 640x480 -> 960x540 MJPEG @ 60fps
Breaks down the contributors (camera negotiation, CoreML graph
rewrites with before/after op latencies, pipeline overlap, GFPGAN
temporal cache, paste-back optimization, platform routing, Windows
CUDA path) and how to reproduce.
REVIEW_TODOS.md captures 12 findings from two independent reviews
(Claude in-tree + Codex second opinion) grouped as Blockers /
Should-fix / Consider, each with file:line and suggested fix. The
two Blocker/Should-fix items are addressed in the preceding commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced in post-squash review of f65aeae:
1. CUDA-graph replay buffers were shared across threads with no lock.
`_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent
and runs run_with_iobinding — concurrent swap calls on Windows/CUDA
could overwrite each other's bound input buffers before replay,
producing wrong-face output. Added `_cuda_graph_lock` around the
full update/run/read sequence.
2. Face enhancer loop unconditionally broke after the first face, so
`many_faces=True` silently enhanced only one face. Also, the
single-slot temporal cache would paste the same enhancement onto
every target if reused in many-faces mode. Gated the break on
`not many_faces_mode` and disabled the cache path in that mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA
fixes, and Mac/Windows runtime routing into a single drop.
CoreML (Apple Silicon):
- Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model
runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream
in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0).
- Fold Shape/Gather chains to constants in det_10g (21ms → 4ms).
- Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms).
- Route detection model to GPU so the ANE is free for the swap model.
- Centralize provider/config selection in create_onnx_session.
Pipeline (all platforms):
- Parallelize face landmark + recognition post-detection; skip landmark_2d_106
when only face_swapper is active.
- Pipeline face detection with swap for ANE overlap.
- GPU-accelerated paste_back, MJPEG capture, zero-copy display path.
- Standalone pipeline benchmark script.
Windows / CUDA:
- CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS.
- Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution.
Cross-platform:
- platform_info helper for Mac/Windows runtime routing.
- GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.
All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.
Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.
The macOS Apple Silicon section installed Python 3.11 but then
referenced Python 3.10 in several places:
- `brew install python-tk@3.10` → python-tk@3.11
- Linux comment "Ensure you use the installed Python 3.10" → 3.11
- CoreML section cross-reference "completed the macOS setup above
using Python 3.10" → 3.11
- `python3.10 run.py` usage command → python3.11
- "You must use Python 3.10" note → 3.11
- `brew reinstall python-tk@3.10` troubleshooting tip → 3.11
- Removed `python@3.11` from the list of conflicting versions to
uninstall (it is the required version, not a conflict)
Fixes#1632
The live webcam preview in ui.py calls process_frame_v2() on all
frame processors, but face_enhancer.py was missing this method.
This caused an AttributeError crash when the GFPGAN face enhancer
was enabled during live mode.
Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>