mirror of
https://github.com/hacksider/Deep-Live-Cam.git
synced 2026-06-01 02:01:36 +02:00
Add PERFORMANCE.md and REVIEW_TODOS.md
PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs hacksider/Deep-Live-Cam main@64d3f06: - Face swap only: <5 FPS -> >20 FPS - Face swap + GFPGAN: <2 FPS -> >10 FPS - Camera: 640x480 -> 960x540 MJPEG @ 60fps Breaks down the contributors (camera negotiation, CoreML graph rewrites with before/after op latencies, pipeline overlap, GFPGAN temporal cache, paste-back optimization, platform routing, Windows CUDA path) and how to reproduce. REVIEW_TODOS.md captures 12 findings from two independent reviews (Claude in-tree + Codex second opinion) grouped as Blockers / Should-fix / Consider, each with file:line and suggested fix. The two Blocker/Should-fix items are addressed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+171
@@ -0,0 +1,171 @@
|
||||
# Apple Silicon + Cross-Platform Performance
|
||||
|
||||
End-to-end measurements from commit `f65aeae` on a MacBook Pro M3 Max
|
||||
against `hacksider/Deep-Live-Cam` upstream `main@64d3f06`, same hardware,
|
||||
same camera, same source/target faces.
|
||||
|
||||
| Mode | Upstream `main` | This fork | Delta |
|
||||
|---|---|---|---|
|
||||
| Face swap only | **<5 FPS** | **>20 FPS** | ~5× |
|
||||
| Face swap + GFPGAN enhancer | **<2 FPS** | **>10 FPS** | ~5× |
|
||||
| Camera resolution | 640×480 default | **960×540 MJPEG** | wider FoV |
|
||||
| Camera frame rate | 15–30 fps (backend default) | **60 fps negotiated + measured** | up to 2–4× |
|
||||
|
||||
The gap is cumulative — no single change accounts for it. Each section
|
||||
below describes one contributor, in rough order of impact on the
|
||||
live-video pipeline.
|
||||
|
||||
## 1. Camera capture negotiation — `modules/video_capture.py`
|
||||
|
||||
Upstream calls `cv2.VideoCapture(device_index)` with no hints and
|
||||
accepts whatever the camera defaults to. On most webcams that means
|
||||
640×480 YUV at 15–30 fps, and `CAP_PROP_FPS` lies on DirectShow.
|
||||
|
||||
This fork:
|
||||
|
||||
- Requests `MJPG` fourcc to bypass USB bandwidth limits on uncompressed
|
||||
YUV at high resolutions.
|
||||
- Requests 960×540 @ 60 fps up front.
|
||||
- Reads back the camera's actual resolution via `CAP_PROP_FRAME_WIDTH/HEIGHT`.
|
||||
- **Empirically measures FPS** by timing 30 frames after 10 warmup
|
||||
reads (`_measure_fps`) instead of trusting `CAP_PROP_FPS`. Costs
|
||||
~0.5–1 s at startup; gives ground-truth numbers that downstream
|
||||
code (detection cadence, enhancer throttle) tunes against.
|
||||
- Windows path tries `CAP_MSMF` before `CAP_DSHOW` (DirectShow often
|
||||
caps at 30 fps even when the camera supports 60).
|
||||
|
||||
This single change is why the resolution / FoV / FPS look different between
|
||||
upstream and the fork before any ML work starts.
|
||||
|
||||
## 2. CoreML graph rewrites — `modules/onnx_optimize.py`
|
||||
|
||||
CoreML EP silently falls back to CPU for ops it doesn't support,
|
||||
creating partition boundaries with CPU↔ANE round-trips between each.
|
||||
Three pre-load rewrites eliminate the fallbacks:
|
||||
|
||||
### 2a. `Pad(mode=reflect)` → `Slice + Concat` (inswapper_128)
|
||||
|
||||
Verified on this machine:
|
||||
|
||||
| Config | Partitions | inswapper latency |
|
||||
|---|---|---|
|
||||
| Original ONNX, ORT 1.24.4 | **14** | 55.3 ms |
|
||||
| Rewritten, ORT 1.24.4 (this pass) | **1** | 27.4 ms |
|
||||
| Original ONNX, ORT 1.26 (main) | **1** | 27.2 ms |
|
||||
|
||||
The third row uses ORT built from `main` at `fb13eb3edd` which contains
|
||||
[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073)
|
||||
— native MIL `pad(mode="reflect")` under `MLProgram`. Once ORT ≥ 1.26 is the
|
||||
floor, this pass can be deleted. See `REVIEW_TODOS.md` for the cleanup note.
|
||||
|
||||
### 2b. `Shape → Gather` folded to constants (det_10g)
|
||||
|
||||
Dynamic shape chains for FPN upsample target sizes forced parts of the
|
||||
face detector onto CPU. When the input shape is known at load time we
|
||||
run ONNX shape inference and replace the chains with `int64` constants.
|
||||
|
||||
Measured: **21 ms → 4 ms** on the detection model.
|
||||
|
||||
### 2c. `Split(axis=1, 2 outputs)` → `Slice` pairs (GFPGAN)
|
||||
|
||||
CoreML EP doesn't support `Split`. GFPGAN's SFT modulation layers use
|
||||
channel-wise splits everywhere, forcing partition boundaries. Rewriting
|
||||
each 2-way Split as two Slices eliminates the fallbacks.
|
||||
|
||||
Measured: **155 ms → 89 ms** on GFPGAN. This is the single largest
|
||||
contributor to the "GFPGAN enabled" row in the headline table.
|
||||
|
||||
All three rewrites cache to disk with a `_coreml` suffix so the cost is
|
||||
paid once per model per machine.
|
||||
|
||||
## 3. Pipeline overlap — `modules/processors/frame/core.py`, `face_swapper.py`
|
||||
|
||||
Face detection and face swap both use the Neural Engine. Running them
|
||||
serially leaves the ANE idle during the detection half. The fork:
|
||||
|
||||
- Overlaps detection N+1 with swap N via a thread pool. Adds one frame
|
||||
of latency; doubles ANE utilization.
|
||||
- Skips `landmark_2d_106` when only `face_swapper` is active (landmarks
|
||||
are unused unless mouth-mask or interpolation is on).
|
||||
- Parallelizes landmark + recognition post-detection when both are
|
||||
needed.
|
||||
- Routes `det_10g` to GPU (Metal) so ANE stays free for the swap model.
|
||||
|
||||
The one-frame detection lag is a known trade-off — acceptable for
|
||||
video frame rates where the face barely moves frame-to-frame. Flagged
|
||||
as a quality risk on fast motion / scene cuts in `REVIEW_TODOS.md`.
|
||||
|
||||
## 4. GFPGAN-specific — `modules/processors/frame/face_enhancer.py`, `_onnx_enhancer.py`
|
||||
|
||||
- Temporal cache: in live mode, run GFPGAN inference every Nth frame
|
||||
and reuse the enhancement; interpolate the affine paste-back each
|
||||
frame. Essentially free interpolation between inferences.
|
||||
- Pre-computed FFHQ 512 landmark template (avoids per-frame matrix solve).
|
||||
- Session created once under `create_onnx_session` with the same
|
||||
`ModelFormat=MLProgram + MLComputeUnits=ALL` config as the swap
|
||||
model — previously GFPGAN fell back to CPU-only.
|
||||
|
||||
## 5. Paste-back optimization — `modules/processors/frame/face_swapper.py`
|
||||
|
||||
`_fast_paste_back` replaces insightface's `paste_back` which operates
|
||||
on the full frame:
|
||||
|
||||
- Computes face bbox from the affine matrix directly (no warp-and-scan
|
||||
of a white mask).
|
||||
- Runs erosion, blur, and blend on the face bbox only.
|
||||
- GPU path (CUDA) keeps mask arithmetic on GPU end-to-end
|
||||
(`torch.nn.functional.max_pool2d` / `avg_pool2d` for erode + blur).
|
||||
- Writes in-place into `target_img` to avoid a full-frame copy.
|
||||
|
||||
## 6. Platform routing — `modules/platform_info.py`
|
||||
|
||||
Single source of truth for OS / accelerator detection. Consumed by
|
||||
capture backend selection, provider config for `face_swapper` and
|
||||
`face_enhancer`, and a one-line startup banner confirming which code
|
||||
path the app took.
|
||||
|
||||
## 7. Windows CUDA path (not exercised in M3 Max numbers)
|
||||
|
||||
Not contributing to the Apple Silicon table but included in the same
|
||||
commit:
|
||||
|
||||
- CUDA graphs via `enable_cuda_graph=1` + `io_binding` with
|
||||
pre-allocated `OrtValue` buffers. Replays the recorded kernel launch
|
||||
sequence each frame with near-zero CPU overhead. Requires static
|
||||
input shape — inswapper is always `1×3×128×128 + 1×512`.
|
||||
- FP16 model auto-selected on GPUs with Tensor Cores (Turing+),
|
||||
falls back to FP32 on older GPUs where FP16 can produce NaN.
|
||||
- DLL discovery fix for NVIDIA CUDA/cuDNN from pip-installed `torch`
|
||||
and `nvidia-*` wheels.
|
||||
|
||||
Enables 1080p @ 60 FPS on NVIDIA hardware (measured separately, not in
|
||||
the table above).
|
||||
|
||||
## Reproducing the measurements
|
||||
|
||||
Cold run (kill all deep-live-cam processes, no active CoreML cache):
|
||||
|
||||
```bash
|
||||
cd /path/to/Deep-Live-Cam
|
||||
.venv/bin/python run.py
|
||||
# Look for:
|
||||
# [platform] ... -> confirms accelerator selection
|
||||
# [VideoCapturer] 960x540 @ 60fps (reported=...)
|
||||
# Partitions: 1 -> from CoreML EP verbose logs
|
||||
```
|
||||
|
||||
For the CoreML partition / inswapper latency numbers specifically, see
|
||||
the standalone test at `/Users/max/Development/onnxruntime_test/test_pad_reflect.py`
|
||||
— runs the model with and without the graph rewrite on any installed ORT.
|
||||
|
||||
## Future cleanup
|
||||
|
||||
When ORT floor ≥ 1.26.0 lands (microsoft/onnxruntime#28073):
|
||||
|
||||
- Delete `_decompose_reflect_pad` in `modules/onnx_optimize.py`.
|
||||
- Drop the `TODO(ort>=1.26)` markers.
|
||||
- Update `requirements.txt`.
|
||||
|
||||
Does not change runtime performance — native MIL reflect matches the
|
||||
Slice+Concat rewrite to within noise (27.2 vs 27.4 ms in the table
|
||||
above). Purely a code-deletion cleanup.
|
||||
+147
@@ -0,0 +1,147 @@
|
||||
# Review TODOs — Apple Silicon + Windows CUDA Perf Commit
|
||||
|
||||
Post-merge review findings for commit `f65aeae` ("Apple Silicon + Windows CUDA
|
||||
perf: 60 FPS pipeline, cross-platform routing"). Findings come from two
|
||||
independent code reviews: Claude (in-tree read) and Codex (second opinion).
|
||||
Severity reflects production impact, not difficulty to fix.
|
||||
|
||||
## Blockers
|
||||
|
||||
### CUDA-graph replay buffers not locked — `modules/processors/frame/face_swapper.py:232-238`
|
||||
*Source: Claude + Codex (independent convergence)*
|
||||
|
||||
`_cuda_graph_swap_inference` mutates module-level `ort_input` / `ort_latent`
|
||||
and runs `run_with_iobinding` with no lock. `multi_process_frame` runs frame
|
||||
work concurrently, so two swap calls can overwrite the same bound input
|
||||
buffers before `run_with_iobinding`, producing wrong-face output or
|
||||
corrupted frames. Compare the DML path at `face_swapper.py:382-386` which
|
||||
uses `modules.globals.dml_lock` for the same reason.
|
||||
|
||||
**Fix:** a dedicated `_cuda_graph_lock` around the full
|
||||
update-run-get sequence inside `_cuda_graph_swap_inference`.
|
||||
|
||||
## Should-fix
|
||||
|
||||
### `many_faces` enhancer loop breaks after face #1 — `modules/processors/frame/face_enhancer.py:375`
|
||||
*Source: Codex*
|
||||
|
||||
The `break` at line 375 is unconditional, so both the fresh-enhance and
|
||||
cache-reuse paths exit the face loop after the first face. In live
|
||||
`many_faces=True` mode, GFPGAN silently enhances only one face.
|
||||
|
||||
**Fix:** gate the `break` on `not modules.globals.many_faces`, and disable
|
||||
the single-slot temporal cache in many-faces mode (cache would be
|
||||
overwritten per face, pasting the wrong enhancement).
|
||||
|
||||
### `poisson_blend` operates on post-swap frame — `modules/processors/frame/face_swapper.py:437`
|
||||
*Source: Claude*
|
||||
|
||||
`create_face_mask(target_face, temp_frame)` is called with `temp_frame`,
|
||||
but `_fast_paste_back` wrote in-place into `temp_frame` a few lines earlier
|
||||
(line 403). The mouth-mask path at line 414 correctly uses
|
||||
`original_frame` — Poisson should do the same.
|
||||
|
||||
**Fix:** pass `original_frame` to `create_face_mask` on the Poisson path.
|
||||
|
||||
### Shape/Gather fold crashes on vector indices — `modules/onnx_optimize.py:150-152`
|
||||
*Source: Codex*
|
||||
|
||||
`int(inits[idx_name])` assumes the Gather index is scalar. Models that
|
||||
gather multiple dims at once pass a vector index — `int()` on a
|
||||
multi-element numpy array raises `TypeError`, aborting the whole
|
||||
optimization pass (no try/except around this section).
|
||||
|
||||
**Fix:** check `inits[idx_name].ndim == 0` or `.size == 1` before folding;
|
||||
skip vector gathers (or fold to a vector constant initializer).
|
||||
|
||||
### Reflect-pad decompose silently wrong for asymmetric pads — `modules/onnx_optimize.py:253`
|
||||
*Source: Codex*
|
||||
|
||||
Only reads `pads[2]` and `pads[3]` (H-start, W-start); ignores `pads[6]`
|
||||
and `pads[7]` (H-end, W-end). Decomposition assumes start == end. Fine
|
||||
for inswapper_128 (symmetric `[0,0,3,3,0,0,3,3]`) but silently produces
|
||||
wrong output shape for any future asymmetric reflect pad.
|
||||
|
||||
**Fix:** read all four pad values and generate top/bottom/left/right
|
||||
slice ranges separately. Or assert symmetry and skip otherwise.
|
||||
|
||||
### `FACE_DETECTION_CACHE` data race — `modules/processors/frame/face_swapper.py:476-506`
|
||||
*Source: Claude*
|
||||
|
||||
`get_faces_optimized` reads and writes `FACE_DETECTION_CACHE` /
|
||||
`LAST_DETECTION_TIME` module globals from multiple frame threads without
|
||||
any lock. Benign in practice (worst case: a duplicate detection or a
|
||||
stale read) but worth a lock wrap for hygiene.
|
||||
|
||||
**Fix:** wrap read-modify-write in `THREAD_LOCK`.
|
||||
|
||||
## Consider
|
||||
|
||||
### Split decompose misses opset-13+ input form — `modules/onnx_optimize.py:346-357`
|
||||
*Source: Codex*
|
||||
|
||||
Only reads the legacy `split` attribute. Opset 13+ passes split sizes as
|
||||
`input[1]`; those Split nodes stay on CPU. Depends on how GFPGAN was
|
||||
exported — verify against `gfpgan-1024.onnx` as actually shipped.
|
||||
|
||||
**Fix:** additionally check `node.input[1]` against initializers for
|
||||
newer opsets.
|
||||
|
||||
### `_preserve_emap_position` matches by shape, not name — `modules/onnx_optimize.py:408-423`
|
||||
*Source: Claude*
|
||||
|
||||
Selects "first 512×512 initializer" as the emap. If insightface ever
|
||||
adds another 512×512 initializer before emap, we'd misplace the tensor.
|
||||
|
||||
**Fix:** key on initializer name (insightface serializes it as `emap`
|
||||
in the proto).
|
||||
|
||||
### One-frame detection lag + misleading comment — `modules/processors/frame/core.py:351-361`
|
||||
*Source: Codex*
|
||||
|
||||
Pipelined detection result applied to the current frame is actually from
|
||||
the previous frame. The inline comment "Get the detection result for
|
||||
THIS frame" contradicts the later comment "the result will be used for
|
||||
the next iteration." Documented latency-for-throughput trade, but the
|
||||
first comment is wrong. Visible as a quality regression on fast motion
|
||||
/ scene cuts.
|
||||
|
||||
**Fix:** correct the comment. Optionally add a config flag to disable
|
||||
pipelining for high-motion footage.
|
||||
|
||||
### Monkey-patching `swapper.session.run` is fragile — `modules/processors/frame/face_swapper.py:223`
|
||||
*Source: Claude*
|
||||
|
||||
`swapper.session.run = _graph_run` replaces the method. If insightface
|
||||
rebuilds or swaps the session (e.g., on reconfigure), the patch is
|
||||
silently lost and we fall back to the standard path without warning.
|
||||
|
||||
**Fix:** wrap the call site instead of monkey-patching the session,
|
||||
or assert the patch survives at key lifecycle points.
|
||||
|
||||
### `_fast_paste_back` accepts unused `aimg` parameter — `modules/processors/frame/face_swapper.py:241`
|
||||
*Source: Claude*
|
||||
|
||||
Caller at line 401-403 allocates a `_aimg_dummy` solely to satisfy the
|
||||
signature. Only `aimg.shape` is used.
|
||||
|
||||
**Fix:** signature `(target_img, bgr_fake, face_h, face_w, M)`.
|
||||
|
||||
### `onnxruntime.get_available_providers()` called at import time — `modules/platform_info.py:33`
|
||||
*Source: Claude*
|
||||
|
||||
Runs before any Windows CUDA DLL path setup from `run.py` takes effect,
|
||||
unless `platform_info` is imported after that setup. Verify import
|
||||
order; otherwise CUDA provider may fail to register.
|
||||
|
||||
**Fix:** lazy-evaluate on first use rather than at module load, or
|
||||
confirm `run.py` imports `platform_info` only after DLL-path shim.
|
||||
|
||||
---
|
||||
|
||||
## ORT 1.26 cleanup
|
||||
|
||||
When ORT floor >= 1.26.0, delete `_decompose_reflect_pad` (pass 2) in
|
||||
`modules/onnx_optimize.py` — fixed upstream by
|
||||
[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073).
|
||||
See the `TODO(ort>=1.26)` markers in the file.
|
||||
Reference in New Issue
Block a user