Add PERFORMANCE.md and REVIEW_TODOS.md

PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs
hacksider/Deep-Live-Cam main@64d3f06:

- Face swap only:     <5 FPS  ->  >20 FPS
- Face swap + GFPGAN: <2 FPS  ->  >10 FPS
- Camera:             640x480 ->  960x540 MJPEG @ 60fps

Breaks down the contributors (camera negotiation, CoreML graph
rewrites with before/after op latencies, pipeline overlap, GFPGAN
temporal cache, paste-back optimization, platform routing, Windows
CUDA path) and how to reproduce.

REVIEW_TODOS.md captures 12 findings from two independent reviews
(Claude in-tree + Codex second opinion) grouped as Blockers /
Should-fix / Consider, each with file:line and suggested fix. The
two Blocker/Should-fix items are addressed in the preceding commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Max Buckley
2026-04-22 11:08:33 +02:00
parent 4d04e830bc
commit ea19030c74
2 changed files with 318 additions and 0 deletions
+171
View File
@@ -0,0 +1,171 @@
# Apple Silicon + Cross-Platform Performance
End-to-end measurements from commit `f65aeae` on a MacBook Pro M3 Max
against `hacksider/Deep-Live-Cam` upstream `main@64d3f06`, same hardware,
same camera, same source/target faces.
| Mode | Upstream `main` | This fork | Delta |
|---|---|---|---|
| Face swap only | **<5 FPS** | **>20 FPS** | ~5× |
| Face swap + GFPGAN enhancer | **<2 FPS** | **>10 FPS** | ~5× |
| Camera resolution | 640×480 default | **960×540 MJPEG** | wider FoV |
| Camera frame rate | 1530 fps (backend default) | **60 fps negotiated + measured** | up to 24× |
The gap is cumulative — no single change accounts for it. Each section
below describes one contributor, in rough order of impact on the
live-video pipeline.
## 1. Camera capture negotiation — `modules/video_capture.py`
Upstream calls `cv2.VideoCapture(device_index)` with no hints and
accepts whatever the camera defaults to. On most webcams that means
640×480 YUV at 1530 fps, and `CAP_PROP_FPS` lies on DirectShow.
This fork:
- Requests `MJPG` fourcc to bypass USB bandwidth limits on uncompressed
YUV at high resolutions.
- Requests 960×540 @ 60 fps up front.
- Reads back the camera's actual resolution via `CAP_PROP_FRAME_WIDTH/HEIGHT`.
- **Empirically measures FPS** by timing 30 frames after 10 warmup
reads (`_measure_fps`) instead of trusting `CAP_PROP_FPS`. Costs
~0.51 s at startup; gives ground-truth numbers that downstream
code (detection cadence, enhancer throttle) tunes against.
- Windows path tries `CAP_MSMF` before `CAP_DSHOW` (DirectShow often
caps at 30 fps even when the camera supports 60).
This single change is why the resolution / FoV / FPS look different between
upstream and the fork before any ML work starts.
## 2. CoreML graph rewrites — `modules/onnx_optimize.py`
CoreML EP silently falls back to CPU for ops it doesn't support,
creating partition boundaries with CPU↔ANE round-trips between each.
Three pre-load rewrites eliminate the fallbacks:
### 2a. `Pad(mode=reflect)` → `Slice + Concat` (inswapper_128)
Verified on this machine:
| Config | Partitions | inswapper latency |
|---|---|---|
| Original ONNX, ORT 1.24.4 | **14** | 55.3 ms |
| Rewritten, ORT 1.24.4 (this pass) | **1** | 27.4 ms |
| Original ONNX, ORT 1.26 (main) | **1** | 27.2 ms |
The third row uses ORT built from `main` at `fb13eb3edd` which contains
[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073)
— native MIL `pad(mode="reflect")` under `MLProgram`. Once ORT ≥ 1.26 is the
floor, this pass can be deleted. See `REVIEW_TODOS.md` for the cleanup note.
### 2b. `Shape → Gather` folded to constants (det_10g)
Dynamic shape chains for FPN upsample target sizes forced parts of the
face detector onto CPU. When the input shape is known at load time we
run ONNX shape inference and replace the chains with `int64` constants.
Measured: **21 ms → 4 ms** on the detection model.
### 2c. `Split(axis=1, 2 outputs)` → `Slice` pairs (GFPGAN)
CoreML EP doesn't support `Split`. GFPGAN's SFT modulation layers use
channel-wise splits everywhere, forcing partition boundaries. Rewriting
each 2-way Split as two Slices eliminates the fallbacks.
Measured: **155 ms → 89 ms** on GFPGAN. This is the single largest
contributor to the "GFPGAN enabled" row in the headline table.
All three rewrites cache to disk with a `_coreml` suffix so the cost is
paid once per model per machine.
## 3. Pipeline overlap — `modules/processors/frame/core.py`, `face_swapper.py`
Face detection and face swap both use the Neural Engine. Running them
serially leaves the ANE idle during the detection half. The fork:
- Overlaps detection N+1 with swap N via a thread pool. Adds one frame
of latency; doubles ANE utilization.
- Skips `landmark_2d_106` when only `face_swapper` is active (landmarks
are unused unless mouth-mask or interpolation is on).
- Parallelizes landmark + recognition post-detection when both are
needed.
- Routes `det_10g` to GPU (Metal) so ANE stays free for the swap model.
The one-frame detection lag is a known trade-off — acceptable for
video frame rates where the face barely moves frame-to-frame. Flagged
as a quality risk on fast motion / scene cuts in `REVIEW_TODOS.md`.
## 4. GFPGAN-specific — `modules/processors/frame/face_enhancer.py`, `_onnx_enhancer.py`
- Temporal cache: in live mode, run GFPGAN inference every Nth frame
and reuse the enhancement; interpolate the affine paste-back each
frame. Essentially free interpolation between inferences.
- Pre-computed FFHQ 512 landmark template (avoids per-frame matrix solve).
- Session created once under `create_onnx_session` with the same
`ModelFormat=MLProgram + MLComputeUnits=ALL` config as the swap
model — previously GFPGAN fell back to CPU-only.
## 5. Paste-back optimization — `modules/processors/frame/face_swapper.py`
`_fast_paste_back` replaces insightface's `paste_back` which operates
on the full frame:
- Computes face bbox from the affine matrix directly (no warp-and-scan
of a white mask).
- Runs erosion, blur, and blend on the face bbox only.
- GPU path (CUDA) keeps mask arithmetic on GPU end-to-end
(`torch.nn.functional.max_pool2d` / `avg_pool2d` for erode + blur).
- Writes in-place into `target_img` to avoid a full-frame copy.
## 6. Platform routing — `modules/platform_info.py`
Single source of truth for OS / accelerator detection. Consumed by
capture backend selection, provider config for `face_swapper` and
`face_enhancer`, and a one-line startup banner confirming which code
path the app took.
## 7. Windows CUDA path (not exercised in M3 Max numbers)
Not contributing to the Apple Silicon table but included in the same
commit:
- CUDA graphs via `enable_cuda_graph=1` + `io_binding` with
pre-allocated `OrtValue` buffers. Replays the recorded kernel launch
sequence each frame with near-zero CPU overhead. Requires static
input shape — inswapper is always `1×3×128×128 + 1×512`.
- FP16 model auto-selected on GPUs with Tensor Cores (Turing+),
falls back to FP32 on older GPUs where FP16 can produce NaN.
- DLL discovery fix for NVIDIA CUDA/cuDNN from pip-installed `torch`
and `nvidia-*` wheels.
Enables 1080p @ 60 FPS on NVIDIA hardware (measured separately, not in
the table above).
## Reproducing the measurements
Cold run (kill all deep-live-cam processes, no active CoreML cache):
```bash
cd /path/to/Deep-Live-Cam
.venv/bin/python run.py
# Look for:
# [platform] ... -> confirms accelerator selection
# [VideoCapturer] 960x540 @ 60fps (reported=...)
# Partitions: 1 -> from CoreML EP verbose logs
```
For the CoreML partition / inswapper latency numbers specifically, see
the standalone test at `/Users/max/Development/onnxruntime_test/test_pad_reflect.py`
— runs the model with and without the graph rewrite on any installed ORT.
## Future cleanup
When ORT floor ≥ 1.26.0 lands (microsoft/onnxruntime#28073):
- Delete `_decompose_reflect_pad` in `modules/onnx_optimize.py`.
- Drop the `TODO(ort>=1.26)` markers.
- Update `requirements.txt`.
Does not change runtime performance — native MIL reflect matches the
Slice+Concat rewrite to within noise (27.2 vs 27.4 ms in the table
above). Purely a code-deletion cleanup.
+147
View File
@@ -0,0 +1,147 @@
# Review TODOs — Apple Silicon + Windows CUDA Perf Commit
Post-merge review findings for commit `f65aeae` ("Apple Silicon + Windows CUDA
perf: 60 FPS pipeline, cross-platform routing"). Findings come from two
independent code reviews: Claude (in-tree read) and Codex (second opinion).
Severity reflects production impact, not difficulty to fix.
## Blockers
### CUDA-graph replay buffers not locked — `modules/processors/frame/face_swapper.py:232-238`
*Source: Claude + Codex (independent convergence)*
`_cuda_graph_swap_inference` mutates module-level `ort_input` / `ort_latent`
and runs `run_with_iobinding` with no lock. `multi_process_frame` runs frame
work concurrently, so two swap calls can overwrite the same bound input
buffers before `run_with_iobinding`, producing wrong-face output or
corrupted frames. Compare the DML path at `face_swapper.py:382-386` which
uses `modules.globals.dml_lock` for the same reason.
**Fix:** a dedicated `_cuda_graph_lock` around the full
update-run-get sequence inside `_cuda_graph_swap_inference`.
## Should-fix
### `many_faces` enhancer loop breaks after face #1 — `modules/processors/frame/face_enhancer.py:375`
*Source: Codex*
The `break` at line 375 is unconditional, so both the fresh-enhance and
cache-reuse paths exit the face loop after the first face. In live
`many_faces=True` mode, GFPGAN silently enhances only one face.
**Fix:** gate the `break` on `not modules.globals.many_faces`, and disable
the single-slot temporal cache in many-faces mode (cache would be
overwritten per face, pasting the wrong enhancement).
### `poisson_blend` operates on post-swap frame — `modules/processors/frame/face_swapper.py:437`
*Source: Claude*
`create_face_mask(target_face, temp_frame)` is called with `temp_frame`,
but `_fast_paste_back` wrote in-place into `temp_frame` a few lines earlier
(line 403). The mouth-mask path at line 414 correctly uses
`original_frame` — Poisson should do the same.
**Fix:** pass `original_frame` to `create_face_mask` on the Poisson path.
### Shape/Gather fold crashes on vector indices — `modules/onnx_optimize.py:150-152`
*Source: Codex*
`int(inits[idx_name])` assumes the Gather index is scalar. Models that
gather multiple dims at once pass a vector index — `int()` on a
multi-element numpy array raises `TypeError`, aborting the whole
optimization pass (no try/except around this section).
**Fix:** check `inits[idx_name].ndim == 0` or `.size == 1` before folding;
skip vector gathers (or fold to a vector constant initializer).
### Reflect-pad decompose silently wrong for asymmetric pads — `modules/onnx_optimize.py:253`
*Source: Codex*
Only reads `pads[2]` and `pads[3]` (H-start, W-start); ignores `pads[6]`
and `pads[7]` (H-end, W-end). Decomposition assumes start == end. Fine
for inswapper_128 (symmetric `[0,0,3,3,0,0,3,3]`) but silently produces
wrong output shape for any future asymmetric reflect pad.
**Fix:** read all four pad values and generate top/bottom/left/right
slice ranges separately. Or assert symmetry and skip otherwise.
### `FACE_DETECTION_CACHE` data race — `modules/processors/frame/face_swapper.py:476-506`
*Source: Claude*
`get_faces_optimized` reads and writes `FACE_DETECTION_CACHE` /
`LAST_DETECTION_TIME` module globals from multiple frame threads without
any lock. Benign in practice (worst case: a duplicate detection or a
stale read) but worth a lock wrap for hygiene.
**Fix:** wrap read-modify-write in `THREAD_LOCK`.
## Consider
### Split decompose misses opset-13+ input form — `modules/onnx_optimize.py:346-357`
*Source: Codex*
Only reads the legacy `split` attribute. Opset 13+ passes split sizes as
`input[1]`; those Split nodes stay on CPU. Depends on how GFPGAN was
exported — verify against `gfpgan-1024.onnx` as actually shipped.
**Fix:** additionally check `node.input[1]` against initializers for
newer opsets.
### `_preserve_emap_position` matches by shape, not name — `modules/onnx_optimize.py:408-423`
*Source: Claude*
Selects "first 512×512 initializer" as the emap. If insightface ever
adds another 512×512 initializer before emap, we'd misplace the tensor.
**Fix:** key on initializer name (insightface serializes it as `emap`
in the proto).
### One-frame detection lag + misleading comment — `modules/processors/frame/core.py:351-361`
*Source: Codex*
Pipelined detection result applied to the current frame is actually from
the previous frame. The inline comment "Get the detection result for
THIS frame" contradicts the later comment "the result will be used for
the next iteration." Documented latency-for-throughput trade, but the
first comment is wrong. Visible as a quality regression on fast motion
/ scene cuts.
**Fix:** correct the comment. Optionally add a config flag to disable
pipelining for high-motion footage.
### Monkey-patching `swapper.session.run` is fragile — `modules/processors/frame/face_swapper.py:223`
*Source: Claude*
`swapper.session.run = _graph_run` replaces the method. If insightface
rebuilds or swaps the session (e.g., on reconfigure), the patch is
silently lost and we fall back to the standard path without warning.
**Fix:** wrap the call site instead of monkey-patching the session,
or assert the patch survives at key lifecycle points.
### `_fast_paste_back` accepts unused `aimg` parameter — `modules/processors/frame/face_swapper.py:241`
*Source: Claude*
Caller at line 401-403 allocates a `_aimg_dummy` solely to satisfy the
signature. Only `aimg.shape` is used.
**Fix:** signature `(target_img, bgr_fake, face_h, face_w, M)`.
### `onnxruntime.get_available_providers()` called at import time — `modules/platform_info.py:33`
*Source: Claude*
Runs before any Windows CUDA DLL path setup from `run.py` takes effect,
unless `platform_info` is imported after that setup. Verify import
order; otherwise CUDA provider may fail to register.
**Fix:** lazy-evaluate on first use rather than at module load, or
confirm `run.py` imports `platform_info` only after DLL-path shim.
---
## ORT 1.26 cleanup
When ORT floor >= 1.26.0, delete `_decompose_reflect_pad` (pass 2) in
`modules/onnx_optimize.py` — fixed upstream by
[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073).
See the `TODO(ort>=1.26)` markers in the file.