590 Commits

Author SHA1 Message Date
KRSHH 014bce0704 Delete PERFORMANCE.md
Removing Claude session summary
2026-04-23 22:12:55 +05:30
KRSHH c962399669 Delete REVIEW_TODOS.md 2026-04-23 22:11:53 +05:30
Kenneth Estanislao 2dd42dfc75 Merge pull request #1777 from maxwbuckley/coreml-scalar-gather-fix
Keep GFPGAN on ANE: widen scalar Gather indices for CoreML EP
2026-04-22 22:17:34 +08:00
Kenneth Estanislao c38d669f7c Merge pull request #1776 from maxwbuckley/paste-back-optimization
Paste-back: O(crop_area) compositing + uint8 cv2 SIMD blend
2026-04-22 22:14:45 +08:00
Max Buckley 890a6d41b6 onnx_optimize: widen scalar Gather indices for CoreML EP
ORT's CoreML EP GatherOpBuilder::IsOpSupportedImpl explicitly rejects
rank-0 (scalar) index tensors. StyleGAN-derived models (GFPGAN's 1024
variant has 16 of them, one per style-code slice) hit this in the
generator, and the resulting CPU fallbacks split the CoreML subgraph
into multiple partitions with boundary crossings on every inference.

Add a load-time ONNX rewrite that promotes each scalar index to [1] and
squeezes the added axis on the Gather output — semantically identical
but CoreML-compatible. GFPGAN now runs as a single CoreML partition with
zero CPU-fallback nodes; inference drops from ~87 ms to ~81 ms on an
M-series Mac.

The fix has been filed upstream as microsoft/onnxruntime#28180 — the
existing code comment in gather_op_builder.cc already describes this
exact workaround, it just isn't applied. Once the upstream fix ships
and the ORT floor is raised, this pass can be deleted.
2026-04-22 14:08:18 +02:00
Max Buckley f95a0bb7fb Make square aligned-face assumption explicit in _fast_paste_back
Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single
NxN template keyed by N, which is correct for the inswapper model
(128x128 aligned-face space) but would silently mis-warp if a caller
ever passed a non-square aligned face. Assert the shape instead of
silently assuming it.
2026-04-22 13:40:18 +02:00
Max Buckley e957a7f4dd Move BGR→RGB after resize in preview display path
The processing thread was running cvtColor on the full-resolution 1920×1080
frame before queueing it for display. Since the display thread immediately
resizes the frame to the preview window (~5× smaller pixel count), doing
the colour conversion on the resized buffer is cheaper overall.

Processing thread now queues BGR; display thread resizes then cvtColor.
2026-04-22 13:31:11 +02:00
Kenneth Estanislao 19416cb3cb Merge pull request #1775 from maxwbuckley/unify-mac-windows
Apple Silicon + Windows CUDA perf: 4-5x FPS, wider capture, platform routing
2026-04-22 18:38:32 +08:00
Max Buckley cbf0859347 Paste-back blend: uint8 cv2 SIMD, no float32 round-trip
Both face_swapper._fast_paste_back and face_enhancer._paste_back were
doing a numpy float32 round-trip per frame: convert the target crop and
the warped face to float32, blend, clip, cast back to uint8. That's four
crop-sized allocations plus unvectorized elementwise math.

Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add,
which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored
alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no
conversion is needed per frame. CUDA paths also simplified — upload uint8
alpha (less bandwidth) and scale on device.

Micro-bench on 1000x1000 RGB crop:
  current (float32 numpy): 9.43 ms
  cv2 uint8 fused:         1.16 ms  (8.1× faster, max diff 2/255)

Visual diff is imperceptible (quantization noise in the last step).
2026-04-22 12:05:39 +02:00
Max Buckley a6c99607fc Cut paste-back from quartic to linear in face size
_fast_paste_back used to erode and Gaussian-blur the warped alpha mask in
output coordinates with kernel sizes proportional to the on-screen face
bbox. That made the per-frame cost ~O(area * k^2) — a face filling half
the frame took ~8x the compositing work of one filling a quarter, which
is why FPS fell off when leaning into the camera.

Instead, build a feathered alpha template once at aligned-face resolution
(128x128 for inswapper) and warp the soft mask per-frame. The affine
transform preserves the relative feather width, so the visual output is
equivalent; the per-frame cost is now O(crop_area) with no size-scaled
erode/blur and no size-scaled padding.

Also collapses the CPU fallback onto the same shape — it previously did
a full-frame warpAffine twice per call, which scaled with the whole
frame instead of the face crop.
2026-04-22 11:58:02 +02:00
Max Buckley 0a87d63560 Address PR #1775 review: pipelined-detection race and CUDA-graph monkey-patch
- core._run_pipe_pipeline: hand the background detector its own copy of
  the frame. The frame processors mutate in place via paste-back, which
  was racing with concurrent face detection on the same buffer.
- face_swapper._init_cuda_graph_session: replace the
  `swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter`
  that proxies every attribute to the underlying session and only
  overrides `.run()`. Guarded so repeat init does not double-wrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:45:59 +02:00
Max Buckley ea19030c74 Add PERFORMANCE.md and REVIEW_TODOS.md
PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs
hacksider/Deep-Live-Cam main@64d3f06:

- Face swap only:     <5 FPS  ->  >20 FPS
- Face swap + GFPGAN: <2 FPS  ->  >10 FPS
- Camera:             640x480 ->  960x540 MJPEG @ 60fps

Breaks down the contributors (camera negotiation, CoreML graph
rewrites with before/after op latencies, pipeline overlap, GFPGAN
temporal cache, paste-back optimization, platform routing, Windows
CUDA path) and how to reproduce.

REVIEW_TODOS.md captures 12 findings from two independent reviews
(Claude in-tree + Codex second opinion) grouped as Blockers /
Should-fix / Consider, each with file:line and suggested fix. The
two Blocker/Should-fix items are addressed in the preceding commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:08:33 +02:00
Max Buckley 4d04e830bc Fix CUDA-graph replay race + many_faces enhancer regression
Two issues surfaced in post-squash review of f65aeae:

1. CUDA-graph replay buffers were shared across threads with no lock.
   `_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent
   and runs run_with_iobinding — concurrent swap calls on Windows/CUDA
   could overwrite each other's bound input buffers before replay,
   producing wrong-face output. Added `_cuda_graph_lock` around the
   full update/run/read sequence.

2. Face enhancer loop unconditionally broke after the first face, so
   `many_faces=True` silently enhanced only one face. Also, the
   single-slot temporal cache would paste the same enhancement onto
   every target if reused in many-faces mode. Gated the break on
   `not many_faces_mode` and disabled the cache path in that mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:08:23 +02:00
Max Buckley f65aeae5db Apple Silicon + Windows CUDA perf: 60 FPS pipeline, cross-platform routing
Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA
fixes, and Mac/Windows runtime routing into a single drop.

CoreML (Apple Silicon):
- Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model
  runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream
  in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0).
- Fold Shape/Gather chains to constants in det_10g (21ms → 4ms).
- Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms).
- Route detection model to GPU so the ANE is free for the swap model.
- Centralize provider/config selection in create_onnx_session.

Pipeline (all platforms):
- Parallelize face landmark + recognition post-detection; skip landmark_2d_106
  when only face_swapper is active.
- Pipeline face detection with swap for ANE overlap.
- GPU-accelerated paste_back, MJPEG capture, zero-copy display path.
- Standalone pipeline benchmark script.

Windows / CUDA:
- CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS.
- Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution.

Cross-platform:
- platform_info helper for Mac/Windows runtime routing.
- GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 10:44:59 +02:00
KRSHH 64d3f06089 Delete tests directory 2026-04-19 17:36:33 +05:30
Kenneth Estanislao fceafcb234 Merge pull request #1751 from Gujiassh/fix/face-mask-none-frame-guard
fix(face-mask): guard create_face_mask against None frame
2026-04-15 14:13:18 +08:00
Kenneth Estanislao 033475b89c Update version in README from 2.1.2 to 2.1.3 2026-04-15 01:29:59 +08:00
Kenneth Estanislao 07711af712 Update contributors section in README.md 2026-04-15 01:29:44 +08:00
Kenneth Estanislao 44664d8a7f Merge pull request #1746 from maxwbuckley/apple-silicon-perf-optimizations
Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss)
2026-04-15 01:25:51 +08:00
gujishh 15a3f537a4 test: cover additional invalid frame guards 2026-04-13 21:09:27 +09:00
gujishh fbcea9e135 fix(face-mask): guard create_face_mask against None frame 2026-04-12 14:19:48 +09:00
Max Buckley 646b0f816f Move hot-path imports to module scope
Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:34:53 +02:00
Max Buckley bcdd0ce2dd Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss)
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.

All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.

Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
  that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
  fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
  redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 14:28:07 +02:00
Kenneth Estanislao 8703d394d6 ONNX CUDA exhaustive convolution search + IO binding 2026-04-09 16:34:27 +08:00
Kenneth Estanislao 69e3fc5611 Rendering optimization
The PNG encode/decode alone was consuming significant CPU time per frame. This is eliminated entirely.
2026-04-09 16:25:22 +08:00
Kenneth Estanislao 2b26d5539e supress error message
Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.
2026-04-09 16:04:00 +08:00
Kenneth Estanislao fea5a4c2d2 Merge pull request #1707 from rohanrathi99/main
Switch to FP32 model by default, add run script
2026-04-05 23:19:17 +08:00
Kenneth Estanislao 51fb7a6ad6 Merge pull request #1722 from mvanhorn/osc/1654-face-enhancer-v2
fix(face-enhancer): add missing process_frame_v2 method
2026-04-05 23:16:52 +08:00
Kenneth Estanislao 6da4f398d5 Merge pull request #1731 from JiayuuWang/contribot/fix-readme-macos-python-version
docs: fix inconsistent Python version references in macOS/Linux setup (fixes #1632)
2026-04-05 23:16:20 +08:00
Kenneth Estanislao 3e362383d8 Merge pull request #1732 from yetval/fix/cuda-vram-exhaustion-video-processing
Fix CUDA VRAM exhaustion during video processing
2026-04-05 23:15:38 +08:00
yetval 11fb5bfbc6 Fix CUDA VRAM exhaustion during video processing (#1721) 2026-04-02 22:59:41 -04:00
jacob-wang 586d8f3fb0 docs: fix inconsistent Python version references in macOS/Linux setup
The macOS Apple Silicon section installed Python 3.11 but then
referenced Python 3.10 in several places:

- `brew install python-tk@3.10` → python-tk@3.11
- Linux comment "Ensure you use the installed Python 3.10" → 3.11
- CoreML section cross-reference "completed the macOS setup above
  using Python 3.10" → 3.11
- `python3.10 run.py` usage command → python3.11
- "You must use Python 3.10" note → 3.11
- `brew reinstall python-tk@3.10` troubleshooting tip → 3.11
- Removed `python@3.11` from the list of conflicting versions to
  uninstall (it is the required version, not a conflict)

Fixes #1632
2026-04-03 10:33:11 +08:00
Kenneth Estanislao 1edc4bc298 DML Lock fixed for cuda and CPU 2026-04-01 23:56:01 +08:00
ozp3 1f3668f7c1 Delete DeepLiveCam.lnk
remove lnk and bat files as requested
2026-04-01 23:56:01 +08:00
ozp3 3d16ee346f Delete run-dml.bat
remove lnk and bat files as requested
2026-04-01 23:56:01 +08:00
ozp3 ab834d5640 feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix 2026-04-01 23:56:01 +08:00
Kenneth Estanislao bf8a89d20a Merge pull request #1725 from jhihweijhan/fix/video-output-pipeline
Fix missing video output reporting and encoding flow
2026-04-01 23:14:22 +08:00
Kenneth Estanislao bb4ef4a133 Apply suggestion from @sourcery-ai[bot]
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
2026-04-01 23:13:59 +08:00
Kenneth Estanislao b6b6c741a2 Revert "Merge pull request #1710 from ozp3/amd-dml-optimization"
This reverts commit 1b240a45fd, reversing
changes made to d9a5500bdf.
2026-04-01 22:33:01 +08:00
Kenneth Estanislao 1b240a45fd Merge pull request #1710 from ozp3/amd-dml-optimization
AMD GPU (DirectML) Optimization for Live Mode
2026-04-01 22:29:43 +08:00
ozp3 ecf02d0640 Delete DeepLiveCam.lnk
remove lnk and bat files as requested
2026-04-01 16:46:28 +03:00
ozp3 0cbc9f126f Delete run-dml.bat
remove lnk and bat files as requested
2026-04-01 16:45:31 +03:00
Karl a3fd56a312 Fix missing video output reporting and encoding flow 2026-04-01 15:22:09 +08:00
Matt Van Horn 9525d45291 fix(face-enhancer): add missing process_frame_v2 method
The live webcam preview in ui.py calls process_frame_v2() on all
frame processors, but face_enhancer.py was missing this method.
This caused an AttributeError crash when the GFPGAN face enhancer
was enabled during live mode.

Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 23:49:12 -07:00
Kenneth Estanislao d9a5500bdf Merge pull request #1713 from TeachDian/fix-1705-wsl-onnxruntime-gpu 2026-03-29 04:54:34 +08:00
TeachDian 86134b6e1d Fix #1705: Update onnxruntime-gpu requirement to 1.23.2 for WSL compatibility 2026-03-29 04:46:48 +08:00
ozp3 fbd1cc5973 docs: add AMD DML optimization notes to README 2026-03-28 13:16:43 +03:00
ozp3 eac2ad2307 feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix 2026-03-28 13:09:20 +03:00
Kenneth Estanislao 9e6f30c0a4 silenced deprecation 2026-03-27 21:35:27 +08:00
Kenneth Estanislao 97321a740d Update face_analyser.py
320 was over optimized, put back to 640
2026-03-27 21:24:19 +08:00