From ea19030c7454f43e4e11489d31ca2b35d7b4ee31 Mon Sep 17 00:00:00 2001
From: Max Buckley <maxwbuckley@gmail.com>
Date: Wed, 22 Apr 2026 11:08:33 +0200
Subject: [PATCH] Add PERFORMANCE.md and REVIEW_TODOS.md

PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs
hacksider/Deep-Live-Cam main@64d3f06:

- Face swap only:     <5 FPS  ->  >20 FPS
- Face swap + GFPGAN: <2 FPS  ->  >10 FPS
- Camera:             640x480 ->  960x540 MJPEG @ 60fps

Breaks down the contributors (camera negotiation, CoreML graph
rewrites with before/after op latencies, pipeline overlap, GFPGAN
temporal cache, paste-back optimization, platform routing, Windows
CUDA path) and how to reproduce.

REVIEW_TODOS.md captures 12 findings from two independent reviews
(Claude in-tree + Codex second opinion) grouped as Blockers /
Should-fix / Consider, each with file:line and suggested fix. The
two Blocker/Should-fix items are addressed in the preceding commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 PERFORMANCE.md  | 171 ++++++++++++++++++++++++++++++++++++++++++++++++
 REVIEW_TODOS.md | 147 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 318 insertions(+)
 create mode 100644 PERFORMANCE.md
 create mode 100644 REVIEW_TODOS.md

diff --git a/PERFORMANCE.md b/PERFORMANCE.md
new file mode 100644
index 0000000..b10cdb8
--- /dev/null
+++ b/PERFORMANCE.md
@@ -0,0 +1,171 @@
+# Apple Silicon + Cross-Platform Performance
+
+End-to-end measurements from commit `f65aeae` on a MacBook Pro M3 Max
+against `hacksider/Deep-Live-Cam` upstream `main@64d3f06`, same hardware,
+same camera, same source/target faces.
+
+| Mode | Upstream `main` | This fork | Delta |
+|---|---|---|---|
+| Face swap only | **<5 FPS** | **>20 FPS** | ~5× |
+| Face swap + GFPGAN enhancer | **<2 FPS** | **>10 FPS** | ~5× |
+| Camera resolution | 640×480 default | **960×540 MJPEG** | wider FoV |
+| Camera frame rate | 15–30 fps (backend default) | **60 fps negotiated + measured** | up to 2–4× |
+
+The gap is cumulative — no single change accounts for it. Each section
+below describes one contributor, in rough order of impact on the
+live-video pipeline.
+
+## 1. Camera capture negotiation — `modules/video_capture.py`
+
+Upstream calls `cv2.VideoCapture(device_index)` with no hints and
+accepts whatever the camera defaults to. On most webcams that means
+640×480 YUV at 15–30 fps, and `CAP_PROP_FPS` lies on DirectShow.
+
+This fork:
+
+- Requests `MJPG` fourcc to bypass USB bandwidth limits on uncompressed
+  YUV at high resolutions.
+- Requests 960×540 @ 60 fps up front.
+- Reads back the camera's actual resolution via `CAP_PROP_FRAME_WIDTH/HEIGHT`.
+- **Empirically measures FPS** by timing 30 frames after 10 warmup
+  reads (`_measure_fps`) instead of trusting `CAP_PROP_FPS`. Costs
+  ~0.5–1 s at startup; gives ground-truth numbers that downstream
+  code (detection cadence, enhancer throttle) tunes against.
+- Windows path tries `CAP_MSMF` before `CAP_DSHOW` (DirectShow often
+  caps at 30 fps even when the camera supports 60).
+
+This single change is why the resolution / FoV / FPS look different between
+upstream and the fork before any ML work starts.
+
+## 2. CoreML graph rewrites — `modules/onnx_optimize.py`
+
+CoreML EP silently falls back to CPU for ops it doesn't support,
+creating partition boundaries with CPU↔ANE round-trips between each.
+Three pre-load rewrites eliminate the fallbacks:
+
+### 2a. `Pad(mode=reflect)` → `Slice + Concat` (inswapper_128)
+
+Verified on this machine:
+
+| Config | Partitions | inswapper latency |
+|---|---|---|
+| Original ONNX, ORT 1.24.4 | **14** | 55.3 ms |
+| Rewritten, ORT 1.24.4 (this pass) | **1** | 27.4 ms |
+| Original ONNX, ORT 1.26 (main) | **1** | 27.2 ms |
+
+The third row uses ORT built from `main` at `fb13eb3edd` which contains
+[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073)
+— native MIL `pad(mode="reflect")` under `MLProgram`. Once ORT ≥ 1.26 is the
+floor, this pass can be deleted. See `REVIEW_TODOS.md` for the cleanup note.
+
+### 2b. `Shape → Gather` folded to constants (det_10g)
+
+Dynamic shape chains for FPN upsample target sizes forced parts of the
+face detector onto CPU. When the input shape is known at load time we
+run ONNX shape inference and replace the chains with `int64` constants.
+
+Measured: **21 ms → 4 ms** on the detection model.
+
+### 2c. `Split(axis=1, 2 outputs)` → `Slice` pairs (GFPGAN)
+
+CoreML EP doesn't support `Split`. GFPGAN's SFT modulation layers use
+channel-wise splits everywhere, forcing partition boundaries. Rewriting
+each 2-way Split as two Slices eliminates the fallbacks.
+
+Measured: **155 ms → 89 ms** on GFPGAN. This is the single largest
+contributor to the "GFPGAN enabled" row in the headline table.
+
+All three rewrites cache to disk with a `_coreml` suffix so the cost is
+paid once per model per machine.
+
+## 3. Pipeline overlap — `modules/processors/frame/core.py`, `face_swapper.py`
+
+Face detection and face swap both use the Neural Engine. Running them
+serially leaves the ANE idle during the detection half. The fork:
+
+- Overlaps detection N+1 with swap N via a thread pool. Adds one frame
+  of latency; doubles ANE utilization.
+- Skips `landmark_2d_106` when only `face_swapper` is active (landmarks
+  are unused unless mouth-mask or interpolation is on).
+- Parallelizes landmark + recognition post-detection when both are
+  needed.
+- Routes `det_10g` to GPU (Metal) so ANE stays free for the swap model.
+
+The one-frame detection lag is a known trade-off — acceptable for
+video frame rates where the face barely moves frame-to-frame. Flagged
+as a quality risk on fast motion / scene cuts in `REVIEW_TODOS.md`.
+
+## 4. GFPGAN-specific — `modules/processors/frame/face_enhancer.py`, `_onnx_enhancer.py`
+
+- Temporal cache: in live mode, run GFPGAN inference every Nth frame
+  and reuse the enhancement; interpolate the affine paste-back each
+  frame. Essentially free interpolation between inferences.
+- Pre-computed FFHQ 512 landmark template (avoids per-frame matrix solve).
+- Session created once under `create_onnx_session` with the same
+  `ModelFormat=MLProgram + MLComputeUnits=ALL` config as the swap
+  model — previously GFPGAN fell back to CPU-only.
+
+## 5. Paste-back optimization — `modules/processors/frame/face_swapper.py`
+
+`_fast_paste_back` replaces insightface's `paste_back` which operates
+on the full frame:
+
+- Computes face bbox from the affine matrix directly (no warp-and-scan
+  of a white mask).
+- Runs erosion, blur, and blend on the face bbox only.
+- GPU path (CUDA) keeps mask arithmetic on GPU end-to-end
+  (`torch.nn.functional.max_pool2d` / `avg_pool2d` for erode + blur).
+- Writes in-place into `target_img` to avoid a full-frame copy.
+
+## 6. Platform routing — `modules/platform_info.py`
+
+Single source of truth for OS / accelerator detection. Consumed by
+capture backend selection, provider config for `face_swapper` and
+`face_enhancer`, and a one-line startup banner confirming which code
+path the app took.
+
+## 7. Windows CUDA path (not exercised in M3 Max numbers)
+
+Not contributing to the Apple Silicon table but included in the same
+commit:
+
+- CUDA graphs via `enable_cuda_graph=1` + `io_binding` with
+  pre-allocated `OrtValue` buffers. Replays the recorded kernel launch
+  sequence each frame with near-zero CPU overhead. Requires static
+  input shape — inswapper is always `1×3×128×128 + 1×512`.
+- FP16 model auto-selected on GPUs with Tensor Cores (Turing+),
+  falls back to FP32 on older GPUs where FP16 can produce NaN.
+- DLL discovery fix for NVIDIA CUDA/cuDNN from pip-installed `torch`
+  and `nvidia-*` wheels.
+
+Enables 1080p @ 60 FPS on NVIDIA hardware (measured separately, not in
+the table above).
+
+## Reproducing the measurements
+
+Cold run (kill all deep-live-cam processes, no active CoreML cache):
+
+```bash
+cd /path/to/Deep-Live-Cam
+.venv/bin/python run.py
+# Look for:
+#   [platform] ...                 -> confirms accelerator selection
+#   [VideoCapturer] 960x540 @ 60fps (reported=...)
+#   Partitions: 1                  -> from CoreML EP verbose logs
+```
+
+For the CoreML partition / inswapper latency numbers specifically, see
+the standalone test at `/Users/max/Development/onnxruntime_test/test_pad_reflect.py`
+— runs the model with and without the graph rewrite on any installed ORT.
+
+## Future cleanup
+
+When ORT floor ≥ 1.26.0 lands (microsoft/onnxruntime#28073):
+
+- Delete `_decompose_reflect_pad` in `modules/onnx_optimize.py`.
+- Drop the `TODO(ort>=1.26)` markers.
+- Update `requirements.txt`.
+
+Does not change runtime performance — native MIL reflect matches the
+Slice+Concat rewrite to within noise (27.2 vs 27.4 ms in the table
+above). Purely a code-deletion cleanup.
diff --git a/REVIEW_TODOS.md b/REVIEW_TODOS.md
new file mode 100644
index 0000000..d4ee738
--- /dev/null
+++ b/REVIEW_TODOS.md
@@ -0,0 +1,147 @@
+# Review TODOs — Apple Silicon + Windows CUDA Perf Commit
+
+Post-merge review findings for commit `f65aeae` ("Apple Silicon + Windows CUDA
+perf: 60 FPS pipeline, cross-platform routing"). Findings come from two
+independent code reviews: Claude (in-tree read) and Codex (second opinion).
+Severity reflects production impact, not difficulty to fix.
+
+## Blockers
+
+### CUDA-graph replay buffers not locked — `modules/processors/frame/face_swapper.py:232-238`
+*Source: Claude + Codex (independent convergence)*
+
+`_cuda_graph_swap_inference` mutates module-level `ort_input` / `ort_latent`
+and runs `run_with_iobinding` with no lock. `multi_process_frame` runs frame
+work concurrently, so two swap calls can overwrite the same bound input
+buffers before `run_with_iobinding`, producing wrong-face output or
+corrupted frames. Compare the DML path at `face_swapper.py:382-386` which
+uses `modules.globals.dml_lock` for the same reason.
+
+**Fix:** a dedicated `_cuda_graph_lock` around the full
+update-run-get sequence inside `_cuda_graph_swap_inference`.
+
+## Should-fix
+
+### `many_faces` enhancer loop breaks after face #1 — `modules/processors/frame/face_enhancer.py:375`
+*Source: Codex*
+
+The `break` at line 375 is unconditional, so both the fresh-enhance and
+cache-reuse paths exit the face loop after the first face. In live
+`many_faces=True` mode, GFPGAN silently enhances only one face.
+
+**Fix:** gate the `break` on `not modules.globals.many_faces`, and disable
+the single-slot temporal cache in many-faces mode (cache would be
+overwritten per face, pasting the wrong enhancement).
+
+### `poisson_blend` operates on post-swap frame — `modules/processors/frame/face_swapper.py:437`
+*Source: Claude*
+
+`create_face_mask(target_face, temp_frame)` is called with `temp_frame`,
+but `_fast_paste_back` wrote in-place into `temp_frame` a few lines earlier
+(line 403). The mouth-mask path at line 414 correctly uses
+`original_frame` — Poisson should do the same.
+
+**Fix:** pass `original_frame` to `create_face_mask` on the Poisson path.
+
+### Shape/Gather fold crashes on vector indices — `modules/onnx_optimize.py:150-152`
+*Source: Codex*
+
+`int(inits[idx_name])` assumes the Gather index is scalar. Models that
+gather multiple dims at once pass a vector index — `int()` on a
+multi-element numpy array raises `TypeError`, aborting the whole
+optimization pass (no try/except around this section).
+
+**Fix:** check `inits[idx_name].ndim == 0` or `.size == 1` before folding;
+skip vector gathers (or fold to a vector constant initializer).
+
+### Reflect-pad decompose silently wrong for asymmetric pads — `modules/onnx_optimize.py:253`
+*Source: Codex*
+
+Only reads `pads[2]` and `pads[3]` (H-start, W-start); ignores `pads[6]`
+and `pads[7]` (H-end, W-end). Decomposition assumes start == end. Fine
+for inswapper_128 (symmetric `[0,0,3,3,0,0,3,3]`) but silently produces
+wrong output shape for any future asymmetric reflect pad.
+
+**Fix:** read all four pad values and generate top/bottom/left/right
+slice ranges separately. Or assert symmetry and skip otherwise.
+
+### `FACE_DETECTION_CACHE` data race — `modules/processors/frame/face_swapper.py:476-506`
+*Source: Claude*
+
+`get_faces_optimized` reads and writes `FACE_DETECTION_CACHE` /
+`LAST_DETECTION_TIME` module globals from multiple frame threads without
+any lock. Benign in practice (worst case: a duplicate detection or a
+stale read) but worth a lock wrap for hygiene.
+
+**Fix:** wrap read-modify-write in `THREAD_LOCK`.
+
+## Consider
+
+### Split decompose misses opset-13+ input form — `modules/onnx_optimize.py:346-357`
+*Source: Codex*
+
+Only reads the legacy `split` attribute. Opset 13+ passes split sizes as
+`input[1]`; those Split nodes stay on CPU. Depends on how GFPGAN was
+exported — verify against `gfpgan-1024.onnx` as actually shipped.
+
+**Fix:** additionally check `node.input[1]` against initializers for
+newer opsets.
+
+### `_preserve_emap_position` matches by shape, not name — `modules/onnx_optimize.py:408-423`
+*Source: Claude*
+
+Selects "first 512×512 initializer" as the emap. If insightface ever
+adds another 512×512 initializer before emap, we'd misplace the tensor.
+
+**Fix:** key on initializer name (insightface serializes it as `emap`
+in the proto).
+
+### One-frame detection lag + misleading comment — `modules/processors/frame/core.py:351-361`
+*Source: Codex*
+
+Pipelined detection result applied to the current frame is actually from
+the previous frame. The inline comment "Get the detection result for
+THIS frame" contradicts the later comment "the result will be used for
+the next iteration." Documented latency-for-throughput trade, but the
+first comment is wrong. Visible as a quality regression on fast motion
+/ scene cuts.
+
+**Fix:** correct the comment. Optionally add a config flag to disable
+pipelining for high-motion footage.
+
+### Monkey-patching `swapper.session.run` is fragile — `modules/processors/frame/face_swapper.py:223`
+*Source: Claude*
+
+`swapper.session.run = _graph_run` replaces the method. If insightface
+rebuilds or swaps the session (e.g., on reconfigure), the patch is
+silently lost and we fall back to the standard path without warning.
+
+**Fix:** wrap the call site instead of monkey-patching the session,
+or assert the patch survives at key lifecycle points.
+
+### `_fast_paste_back` accepts unused `aimg` parameter — `modules/processors/frame/face_swapper.py:241`
+*Source: Claude*
+
+Caller at line 401-403 allocates a `_aimg_dummy` solely to satisfy the
+signature. Only `aimg.shape` is used.
+
+**Fix:** signature `(target_img, bgr_fake, face_h, face_w, M)`.
+
+### `onnxruntime.get_available_providers()` called at import time — `modules/platform_info.py:33`
+*Source: Claude*
+
+Runs before any Windows CUDA DLL path setup from `run.py` takes effect,
+unless `platform_info` is imported after that setup. Verify import
+order; otherwise CUDA provider may fail to register.
+
+**Fix:** lazy-evaluate on first use rather than at module load, or
+confirm `run.py` imports `platform_info` only after DLL-path shim.
+
+---
+
+## ORT 1.26 cleanup
+
+When ORT floor >= 1.26.0, delete `_decompose_reflect_pad` (pass 2) in
+`modules/onnx_optimize.py` — fixed upstream by
+[microsoft/onnxruntime#28073](https://github.com/microsoft/onnxruntime/pull/28073).
+See the `TODO(ort>=1.26)` markers in the file.