From ea19030c7454f43e4e11489d31ca2b35d7b4ee31 Mon Sep 17 00:00:00 2001 From: Max Buckley Date: Wed, 22 Apr 2026 11:08:33 +0200 Subject: [PATCH] Add PERFORMANCE.md and REVIEW_TODOS.md PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs hacksider/Deep-Live-Cam main@64d3f06: - Face swap only: <5 FPS -> >20 FPS - Face swap + GFPGAN: <2 FPS -> >10 FPS - Camera: 640x480 -> 960x540 MJPEG @ 60fps Breaks down the contributors (camera negotiation, CoreML graph rewrites with before/after op latencies, pipeline overlap, GFPGAN temporal cache, paste-back optimization, platform routing, Windows CUDA path) and how to reproduce. REVIEW_TODOS.md captures 12 findings from two independent reviews (Claude in-tree + Codex second opinion) grouped as Blockers / Should-fix / Consider, each with file:line and suggested fix. The two Blocker/Should-fix items are addressed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) --- PERFORMANCE.md | 171 ++++++++++++++++++++++++++++++++++++++++++++++++ REVIEW_TODOS.md | 147 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 318 insertions(+) create mode 100644 PERFORMANCE.md create mode 100644 REVIEW_TODOS.md diff --git a/PERFORMANCE.md b/PERFORMANCE.md new file mode 100644 index 0000000..b10cdb8 --- /dev/null +++ b/PERFORMANCE.md @@ -0,0 +1,171 @@ +# Apple Silicon + Cross-Platform Performance + +End-to-end measurements from commit `f65aeae` on a MacBook Pro M3 Max +against `hacksider/Deep-Live-Cam` upstream `main@64d3f06`, same hardware, +same camera, same source/target faces. + +| Mode | Upstream `main` | This fork | Delta | +|---|---|---|---| +| Face swap only | **<5 FPS** | **>20 FPS** | ~5× | +| Face swap + GFPGAN enhancer | **<2 FPS** | **>10 FPS** | ~5× | +| Camera resolution | 640×480 default | **960×540 MJPEG** | wider FoV | +| Camera frame rate | 15–30 fps (backend default) | **60 fps negotiated + measured** | up to 2–4× | + +The gap is cumulative — no single change accounts for it. Each section +below describes one contributor, in rough order of impact on the +live-video pipeline. + +## 1. Camera capture negotiation — `modules/video_capture.py` + +Upstream calls `cv2.VideoCapture(device_index)` with no hints and +accepts whatever the camera defaults to. On most webcams that means +640×480 YUV at 15–30 fps, and `CAP_PROP_FPS` lies on DirectShow. + +This fork: + +- Requests `MJPG` fourcc to bypass USB bandwidth limits on uncompressed + YUV at high resolutions. +- Requests 960×540 @ 60 fps up front. +- Reads back the camera's actual resolution via `CAP_PROP_FRAME_WIDTH/HEIGHT`. +- **Empirically measures FPS** by timing 30 frames after 10 warmup + reads (`_measure_fps`) instead of trusting `CAP_PROP_FPS`. Costs + ~0.5–1 s at startup; gives ground-truth numbers that downstream + code (detection cadence, enhancer throttle) tunes against. +- Windows path tries `CAP_MSMF` before `CAP_DSHOW` (DirectShow often + caps at 30 fps even when the camera supports 60). + +This single change is why the resolution / FoV / FPS look different between +upstream and the fork before any ML work starts. + +## 2. CoreML graph rewrites — `modules/onnx_optimize.py` + +CoreML EP silently falls back to CPU for ops it doesn't support, +creating partition boundaries with CPU↔ANE round-trips between each. +Three pre-load rewrites eliminate the fallbacks: + +### 2a. `Pad(mode=reflect)` → `Slice + Concat` (inswapper_128) + +Verified on this machine: + +| Config | Partitions | inswapper latency | +|---|---|---| +| Original ONNX, ORT 1.24.4 | **14** | 55.3 ms | +| Rewritten, ORT 1.24.4 (this pass) | **1** | 27.4 ms | +| Original ONNX, ORT 1.26 (main) | **1** | 27.2 ms | + +The third row uses ORT built from `main` at `fb13eb3edd` which contains +[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073) +— native MIL `pad(mode="reflect")` under `MLProgram`. Once ORT ≥ 1.26 is the +floor, this pass can be deleted. See `REVIEW_TODOS.md` for the cleanup note. + +### 2b. `Shape → Gather` folded to constants (det_10g) + +Dynamic shape chains for FPN upsample target sizes forced parts of the +face detector onto CPU. When the input shape is known at load time we +run ONNX shape inference and replace the chains with `int64` constants. + +Measured: **21 ms → 4 ms** on the detection model. + +### 2c. `Split(axis=1, 2 outputs)` → `Slice` pairs (GFPGAN) + +CoreML EP doesn't support `Split`. GFPGAN's SFT modulation layers use +channel-wise splits everywhere, forcing partition boundaries. Rewriting +each 2-way Split as two Slices eliminates the fallbacks. + +Measured: **155 ms → 89 ms** on GFPGAN. This is the single largest +contributor to the "GFPGAN enabled" row in the headline table. + +All three rewrites cache to disk with a `_coreml` suffix so the cost is +paid once per model per machine. + +## 3. Pipeline overlap — `modules/processors/frame/core.py`, `face_swapper.py` + +Face detection and face swap both use the Neural Engine. Running them +serially leaves the ANE idle during the detection half. The fork: + +- Overlaps detection N+1 with swap N via a thread pool. Adds one frame + of latency; doubles ANE utilization. +- Skips `landmark_2d_106` when only `face_swapper` is active (landmarks + are unused unless mouth-mask or interpolation is on). +- Parallelizes landmark + recognition post-detection when both are + needed. +- Routes `det_10g` to GPU (Metal) so ANE stays free for the swap model. + +The one-frame detection lag is a known trade-off — acceptable for +video frame rates where the face barely moves frame-to-frame. Flagged +as a quality risk on fast motion / scene cuts in `REVIEW_TODOS.md`. + +## 4. GFPGAN-specific — `modules/processors/frame/face_enhancer.py`, `_onnx_enhancer.py` + +- Temporal cache: in live mode, run GFPGAN inference every Nth frame + and reuse the enhancement; interpolate the affine paste-back each + frame. Essentially free interpolation between inferences. +- Pre-computed FFHQ 512 landmark template (avoids per-frame matrix solve). +- Session created once under `create_onnx_session` with the same + `ModelFormat=MLProgram + MLComputeUnits=ALL` config as the swap + model — previously GFPGAN fell back to CPU-only. + +## 5. Paste-back optimization — `modules/processors/frame/face_swapper.py` + +`_fast_paste_back` replaces insightface's `paste_back` which operates +on the full frame: + +- Computes face bbox from the affine matrix directly (no warp-and-scan + of a white mask). +- Runs erosion, blur, and blend on the face bbox only. +- GPU path (CUDA) keeps mask arithmetic on GPU end-to-end + (`torch.nn.functional.max_pool2d` / `avg_pool2d` for erode + blur). +- Writes in-place into `target_img` to avoid a full-frame copy. + +## 6. Platform routing — `modules/platform_info.py` + +Single source of truth for OS / accelerator detection. Consumed by +capture backend selection, provider config for `face_swapper` and +`face_enhancer`, and a one-line startup banner confirming which code +path the app took. + +## 7. Windows CUDA path (not exercised in M3 Max numbers) + +Not contributing to the Apple Silicon table but included in the same +commit: + +- CUDA graphs via `enable_cuda_graph=1` + `io_binding` with + pre-allocated `OrtValue` buffers. Replays the recorded kernel launch + sequence each frame with near-zero CPU overhead. Requires static + input shape — inswapper is always `1×3×128×128 + 1×512`. +- FP16 model auto-selected on GPUs with Tensor Cores (Turing+), + falls back to FP32 on older GPUs where FP16 can produce NaN. +- DLL discovery fix for NVIDIA CUDA/cuDNN from pip-installed `torch` + and `nvidia-*` wheels. + +Enables 1080p @ 60 FPS on NVIDIA hardware (measured separately, not in +the table above). + +## Reproducing the measurements + +Cold run (kill all deep-live-cam processes, no active CoreML cache): + +```bash +cd /path/to/Deep-Live-Cam +.venv/bin/python run.py +# Look for: +# [platform] ... -> confirms accelerator selection +# [VideoCapturer] 960x540 @ 60fps (reported=...) +# Partitions: 1 -> from CoreML EP verbose logs +``` + +For the CoreML partition / inswapper latency numbers specifically, see +the standalone test at `/Users/max/Development/onnxruntime_test/test_pad_reflect.py` +— runs the model with and without the graph rewrite on any installed ORT. + +## Future cleanup + +When ORT floor ≥ 1.26.0 lands (microsoft/onnxruntime#28073): + +- Delete `_decompose_reflect_pad` in `modules/onnx_optimize.py`. +- Drop the `TODO(ort>=1.26)` markers. +- Update `requirements.txt`. + +Does not change runtime performance — native MIL reflect matches the +Slice+Concat rewrite to within noise (27.2 vs 27.4 ms in the table +above). Purely a code-deletion cleanup. diff --git a/REVIEW_TODOS.md b/REVIEW_TODOS.md new file mode 100644 index 0000000..d4ee738 --- /dev/null +++ b/REVIEW_TODOS.md @@ -0,0 +1,147 @@ +# Review TODOs — Apple Silicon + Windows CUDA Perf Commit + +Post-merge review findings for commit `f65aeae` ("Apple Silicon + Windows CUDA +perf: 60 FPS pipeline, cross-platform routing"). Findings come from two +independent code reviews: Claude (in-tree read) and Codex (second opinion). +Severity reflects production impact, not difficulty to fix. + +## Blockers + +### CUDA-graph replay buffers not locked — `modules/processors/frame/face_swapper.py:232-238` +*Source: Claude + Codex (independent convergence)* + +`_cuda_graph_swap_inference` mutates module-level `ort_input` / `ort_latent` +and runs `run_with_iobinding` with no lock. `multi_process_frame` runs frame +work concurrently, so two swap calls can overwrite the same bound input +buffers before `run_with_iobinding`, producing wrong-face output or +corrupted frames. Compare the DML path at `face_swapper.py:382-386` which +uses `modules.globals.dml_lock` for the same reason. + +**Fix:** a dedicated `_cuda_graph_lock` around the full +update-run-get sequence inside `_cuda_graph_swap_inference`. + +## Should-fix + +### `many_faces` enhancer loop breaks after face #1 — `modules/processors/frame/face_enhancer.py:375` +*Source: Codex* + +The `break` at line 375 is unconditional, so both the fresh-enhance and +cache-reuse paths exit the face loop after the first face. In live +`many_faces=True` mode, GFPGAN silently enhances only one face. + +**Fix:** gate the `break` on `not modules.globals.many_faces`, and disable +the single-slot temporal cache in many-faces mode (cache would be +overwritten per face, pasting the wrong enhancement). + +### `poisson_blend` operates on post-swap frame — `modules/processors/frame/face_swapper.py:437` +*Source: Claude* + +`create_face_mask(target_face, temp_frame)` is called with `temp_frame`, +but `_fast_paste_back` wrote in-place into `temp_frame` a few lines earlier +(line 403). The mouth-mask path at line 414 correctly uses +`original_frame` — Poisson should do the same. + +**Fix:** pass `original_frame` to `create_face_mask` on the Poisson path. + +### Shape/Gather fold crashes on vector indices — `modules/onnx_optimize.py:150-152` +*Source: Codex* + +`int(inits[idx_name])` assumes the Gather index is scalar. Models that +gather multiple dims at once pass a vector index — `int()` on a +multi-element numpy array raises `TypeError`, aborting the whole +optimization pass (no try/except around this section). + +**Fix:** check `inits[idx_name].ndim == 0` or `.size == 1` before folding; +skip vector gathers (or fold to a vector constant initializer). + +### Reflect-pad decompose silently wrong for asymmetric pads — `modules/onnx_optimize.py:253` +*Source: Codex* + +Only reads `pads[2]` and `pads[3]` (H-start, W-start); ignores `pads[6]` +and `pads[7]` (H-end, W-end). Decomposition assumes start == end. Fine +for inswapper_128 (symmetric `[0,0,3,3,0,0,3,3]`) but silently produces +wrong output shape for any future asymmetric reflect pad. + +**Fix:** read all four pad values and generate top/bottom/left/right +slice ranges separately. Or assert symmetry and skip otherwise. + +### `FACE_DETECTION_CACHE` data race — `modules/processors/frame/face_swapper.py:476-506` +*Source: Claude* + +`get_faces_optimized` reads and writes `FACE_DETECTION_CACHE` / +`LAST_DETECTION_TIME` module globals from multiple frame threads without +any lock. Benign in practice (worst case: a duplicate detection or a +stale read) but worth a lock wrap for hygiene. + +**Fix:** wrap read-modify-write in `THREAD_LOCK`. + +## Consider + +### Split decompose misses opset-13+ input form — `modules/onnx_optimize.py:346-357` +*Source: Codex* + +Only reads the legacy `split` attribute. Opset 13+ passes split sizes as +`input[1]`; those Split nodes stay on CPU. Depends on how GFPGAN was +exported — verify against `gfpgan-1024.onnx` as actually shipped. + +**Fix:** additionally check `node.input[1]` against initializers for +newer opsets. + +### `_preserve_emap_position` matches by shape, not name — `modules/onnx_optimize.py:408-423` +*Source: Claude* + +Selects "first 512×512 initializer" as the emap. If insightface ever +adds another 512×512 initializer before emap, we'd misplace the tensor. + +**Fix:** key on initializer name (insightface serializes it as `emap` +in the proto). + +### One-frame detection lag + misleading comment — `modules/processors/frame/core.py:351-361` +*Source: Codex* + +Pipelined detection result applied to the current frame is actually from +the previous frame. The inline comment "Get the detection result for +THIS frame" contradicts the later comment "the result will be used for +the next iteration." Documented latency-for-throughput trade, but the +first comment is wrong. Visible as a quality regression on fast motion +/ scene cuts. + +**Fix:** correct the comment. Optionally add a config flag to disable +pipelining for high-motion footage. + +### Monkey-patching `swapper.session.run` is fragile — `modules/processors/frame/face_swapper.py:223` +*Source: Claude* + +`swapper.session.run = _graph_run` replaces the method. If insightface +rebuilds or swaps the session (e.g., on reconfigure), the patch is +silently lost and we fall back to the standard path without warning. + +**Fix:** wrap the call site instead of monkey-patching the session, +or assert the patch survives at key lifecycle points. + +### `_fast_paste_back` accepts unused `aimg` parameter — `modules/processors/frame/face_swapper.py:241` +*Source: Claude* + +Caller at line 401-403 allocates a `_aimg_dummy` solely to satisfy the +signature. Only `aimg.shape` is used. + +**Fix:** signature `(target_img, bgr_fake, face_h, face_w, M)`. + +### `onnxruntime.get_available_providers()` called at import time — `modules/platform_info.py:33` +*Source: Claude* + +Runs before any Windows CUDA DLL path setup from `run.py` takes effect, +unless `platform_info` is imported after that setup. Verify import +order; otherwise CUDA provider may fail to register. + +**Fix:** lazy-evaluate on first use rather than at module load, or +confirm `run.py` imports `platform_info` only after DLL-path shim. + +--- + +## ORT 1.26 cleanup + +When ORT floor >= 1.26.0, delete `_decompose_reflect_pad` (pass 2) in +`modules/onnx_optimize.py` — fixed upstream by +[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073). +See the `TODO(ort>=1.26)` markers in the file.