Skip prepending a directory that is already on LD_LIBRARY_PATH, so a
repeated import of run.py does not bloat the variable.
Addresses review feedback on #1826.
Mirrors the Windows preload block from #1775. When onnxruntime-gpu is
installed via pip with nvidia-cudnn-cu12, the .so files sit under
venv/lib/pythonX.Y/site-packages/nvidia/<pkg>/lib/ and the dynamic
linker never sees them. LD_LIBRARY_PATH cannot be set after Python
starts.
Pre-loads every lib*.so* via ctypes.CDLL with RTLD_GLOBAL before
onnxruntime opens its CUDA provider. Also extends LD_LIBRARY_PATH so
child processes (ffmpeg) inherit the path.
Fixes "libcudnn.so.9: cannot open shared object file" on pip-only
Linux installs.
ORT's CoreML EP GatherOpBuilder::IsOpSupportedImpl explicitly rejects
rank-0 (scalar) index tensors. StyleGAN-derived models (GFPGAN's 1024
variant has 16 of them, one per style-code slice) hit this in the
generator, and the resulting CPU fallbacks split the CoreML subgraph
into multiple partitions with boundary crossings on every inference.
Add a load-time ONNX rewrite that promotes each scalar index to [1] and
squeezes the added axis on the Gather output — semantically identical
but CoreML-compatible. GFPGAN now runs as a single CoreML partition with
zero CPU-fallback nodes; inference drops from ~87 ms to ~81 ms on an
M-series Mac.
The fix has been filed upstream as microsoft/onnxruntime#28180 — the
existing code comment in gather_op_builder.cc already describes this
exact workaround, it just isn't applied. Once the upstream fix ships
and the ORT floor is raised, this pass can be deleted.
Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single
NxN template keyed by N, which is correct for the inswapper model
(128x128 aligned-face space) but would silently mis-warp if a caller
ever passed a non-square aligned face. Assert the shape instead of
silently assuming it.
The processing thread was running cvtColor on the full-resolution 1920×1080
frame before queueing it for display. Since the display thread immediately
resizes the frame to the preview window (~5× smaller pixel count), doing
the colour conversion on the resized buffer is cheaper overall.
Processing thread now queues BGR; display thread resizes then cvtColor.
Both face_swapper._fast_paste_back and face_enhancer._paste_back were
doing a numpy float32 round-trip per frame: convert the target crop and
the warped face to float32, blend, clip, cast back to uint8. That's four
crop-sized allocations plus unvectorized elementwise math.
Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add,
which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored
alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no
conversion is needed per frame. CUDA paths also simplified — upload uint8
alpha (less bandwidth) and scale on device.
Micro-bench on 1000x1000 RGB crop:
current (float32 numpy): 9.43 ms
cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255)
Visual diff is imperceptible (quantization noise in the last step).
_fast_paste_back used to erode and Gaussian-blur the warped alpha mask in
output coordinates with kernel sizes proportional to the on-screen face
bbox. That made the per-frame cost ~O(area * k^2) — a face filling half
the frame took ~8x the compositing work of one filling a quarter, which
is why FPS fell off when leaning into the camera.
Instead, build a feathered alpha template once at aligned-face resolution
(128x128 for inswapper) and warp the soft mask per-frame. The affine
transform preserves the relative feather width, so the visual output is
equivalent; the per-frame cost is now O(crop_area) with no size-scaled
erode/blur and no size-scaled padding.
Also collapses the CPU fallback onto the same shape — it previously did
a full-frame warpAffine twice per call, which scaled with the whole
frame instead of the face crop.
- core._run_pipe_pipeline: hand the background detector its own copy of
the frame. The frame processors mutate in place via paste-back, which
was racing with concurrent face detection on the same buffer.
- face_swapper._init_cuda_graph_session: replace the
`swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter`
that proxies every attribute to the underlying session and only
overrides `.run()`. Guarded so repeat init does not double-wrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs
hacksider/Deep-Live-Cam main@64d3f06:
- Face swap only: <5 FPS -> >20 FPS
- Face swap + GFPGAN: <2 FPS -> >10 FPS
- Camera: 640x480 -> 960x540 MJPEG @ 60fps
Breaks down the contributors (camera negotiation, CoreML graph
rewrites with before/after op latencies, pipeline overlap, GFPGAN
temporal cache, paste-back optimization, platform routing, Windows
CUDA path) and how to reproduce.
REVIEW_TODOS.md captures 12 findings from two independent reviews
(Claude in-tree + Codex second opinion) grouped as Blockers /
Should-fix / Consider, each with file:line and suggested fix. The
two Blocker/Should-fix items are addressed in the preceding commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced in post-squash review of f65aeae:
1. CUDA-graph replay buffers were shared across threads with no lock.
`_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent
and runs run_with_iobinding — concurrent swap calls on Windows/CUDA
could overwrite each other's bound input buffers before replay,
producing wrong-face output. Added `_cuda_graph_lock` around the
full update/run/read sequence.
2. Face enhancer loop unconditionally broke after the first face, so
`many_faces=True` silently enhanced only one face. Also, the
single-slot temporal cache would paste the same enhancement onto
every target if reused in many-faces mode. Gated the break on
`not many_faces_mode` and disabled the cache path in that mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA
fixes, and Mac/Windows runtime routing into a single drop.
CoreML (Apple Silicon):
- Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model
runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream
in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0).
- Fold Shape/Gather chains to constants in det_10g (21ms → 4ms).
- Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms).
- Route detection model to GPU so the ANE is free for the swap model.
- Centralize provider/config selection in create_onnx_session.
Pipeline (all platforms):
- Parallelize face landmark + recognition post-detection; skip landmark_2d_106
when only face_swapper is active.
- Pipeline face detection with swap for ANE overlap.
- GPU-accelerated paste_back, MJPEG capture, zero-copy display path.
- Standalone pipeline benchmark script.
Windows / CUDA:
- CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS.
- Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution.
Cross-platform:
- platform_info helper for Mac/Windows runtime routing.
- GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.
All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.
Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.