hacksider-Deep-Live-Cam

mirror of https://github.com/hacksider/Deep-Live-Cam.git synced 2026-04-29 12:55:57 +02:00

Author	SHA1	Message	Date
KRSHH	014bce0704	Delete PERFORMANCE.md Removing Claude session summary	2026-04-23 22:12:55 +05:30
KRSHH	c962399669	Delete REVIEW_TODOS.md	2026-04-23 22:11:53 +05:30
Kenneth Estanislao	2dd42dfc75	Merge pull request #1777 from maxwbuckley/coreml-scalar-gather-fix Keep GFPGAN on ANE: widen scalar Gather indices for CoreML EP	2026-04-22 22:17:34 +08:00
Kenneth Estanislao	c38d669f7c	Merge pull request #1776 from maxwbuckley/paste-back-optimization Paste-back: O(crop_area) compositing + uint8 cv2 SIMD blend	2026-04-22 22:14:45 +08:00
Max Buckley	890a6d41b6	onnx_optimize: widen scalar Gather indices for CoreML EP ORT's CoreML EP GatherOpBuilder::IsOpSupportedImpl explicitly rejects rank-0 (scalar) index tensors. StyleGAN-derived models (GFPGAN's 1024 variant has 16 of them, one per style-code slice) hit this in the generator, and the resulting CPU fallbacks split the CoreML subgraph into multiple partitions with boundary crossings on every inference. Add a load-time ONNX rewrite that promotes each scalar index to [1] and squeezes the added axis on the Gather output — semantically identical but CoreML-compatible. GFPGAN now runs as a single CoreML partition with zero CPU-fallback nodes; inference drops from ~87 ms to ~81 ms on an M-series Mac. The fix has been filed upstream as microsoft/onnxruntime#28180 — the existing code comment in gather_op_builder.cc already describes this exact workaround, it just isn't applied. Once the upstream fix ships and the ORT floor is raised, this pass can be deleted.	2026-04-22 14:08:18 +02:00
Max Buckley	f95a0bb7fb	Make square aligned-face assumption explicit in _fast_paste_back Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single NxN template keyed by N, which is correct for the inswapper model (128x128 aligned-face space) but would silently mis-warp if a caller ever passed a non-square aligned face. Assert the shape instead of silently assuming it.	2026-04-22 13:40:18 +02:00
Max Buckley	e957a7f4dd	Move BGR→RGB after resize in preview display path The processing thread was running cvtColor on the full-resolution 1920×1080 frame before queueing it for display. Since the display thread immediately resizes the frame to the preview window (~5× smaller pixel count), doing the colour conversion on the resized buffer is cheaper overall. Processing thread now queues BGR; display thread resizes then cvtColor.	2026-04-22 13:31:11 +02:00
Kenneth Estanislao	19416cb3cb	Merge pull request #1775 from maxwbuckley/unify-mac-windows Apple Silicon + Windows CUDA perf: 4-5x FPS, wider capture, platform routing	2026-04-22 18:38:32 +08:00
Max Buckley	cbf0859347	Paste-back blend: uint8 cv2 SIMD, no float32 round-trip Both face_swapper._fast_paste_back and face_enhancer._paste_back were doing a numpy float32 round-trip per frame: convert the target crop and the warped face to float32, blend, clip, cast back to uint8. That's four crop-sized allocations plus unvectorized elementwise math. Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add, which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no conversion is needed per frame. CUDA paths also simplified — upload uint8 alpha (less bandwidth) and scale on device. Micro-bench on 1000x1000 RGB crop: current (float32 numpy): 9.43 ms cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255) Visual diff is imperceptible (quantization noise in the last step).	2026-04-22 12:05:39 +02:00
Max Buckley	a6c99607fc	Cut paste-back from quartic to linear in face size _fast_paste_back used to erode and Gaussian-blur the warped alpha mask in output coordinates with kernel sizes proportional to the on-screen face bbox. That made the per-frame cost ~O(area * k^2) — a face filling half the frame took ~8x the compositing work of one filling a quarter, which is why FPS fell off when leaning into the camera. Instead, build a feathered alpha template once at aligned-face resolution (128x128 for inswapper) and warp the soft mask per-frame. The affine transform preserves the relative feather width, so the visual output is equivalent; the per-frame cost is now O(crop_area) with no size-scaled erode/blur and no size-scaled padding. Also collapses the CPU fallback onto the same shape — it previously did a full-frame warpAffine twice per call, which scaled with the whole frame instead of the face crop.	2026-04-22 11:58:02 +02:00
Max Buckley	0a87d63560	Address PR #1775 review: pipelined-detection race and CUDA-graph monkey-patch - core._run_pipe_pipeline: hand the background detector its own copy of the frame. The frame processors mutate in place via paste-back, which was racing with concurrent face detection on the same buffer. - face_swapper._init_cuda_graph_session: replace the `swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter` that proxies every attribute to the underlying session and only overrides `.run()`. Guarded so repeat init does not double-wrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:45:59 +02:00
Max Buckley	ea19030c74	Add PERFORMANCE.md and REVIEW_TODOS.md PERFORMANCE.md documents measured gains on MacBook Pro M3 Max vs hacksider/Deep-Live-Cam main@64d3f06: - Face swap only: <5 FPS -> >20 FPS - Face swap + GFPGAN: <2 FPS -> >10 FPS - Camera: 640x480 -> 960x540 MJPEG @ 60fps Breaks down the contributors (camera negotiation, CoreML graph rewrites with before/after op latencies, pipeline overlap, GFPGAN temporal cache, paste-back optimization, platform routing, Windows CUDA path) and how to reproduce. REVIEW_TODOS.md captures 12 findings from two independent reviews (Claude in-tree + Codex second opinion) grouped as Blockers / Should-fix / Consider, each with file:line and suggested fix. The two Blocker/Should-fix items are addressed in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:08:33 +02:00
Max Buckley	4d04e830bc	Fix CUDA-graph replay race + many_faces enhancer regression Two issues surfaced in post-squash review of `f65aeae`: 1. CUDA-graph replay buffers were shared across threads with no lock. `_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent and runs run_with_iobinding — concurrent swap calls on Windows/CUDA could overwrite each other's bound input buffers before replay, producing wrong-face output. Added `_cuda_graph_lock` around the full update/run/read sequence. 2. Face enhancer loop unconditionally broke after the first face, so `many_faces=True` silently enhanced only one face. Also, the single-slot temporal cache would paste the same enhancement onto every target if reused in many-faces mode. Gated the break on `not many_faces_mode` and disabled the cache path in that mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:08:23 +02:00
Max Buckley	f65aeae5db	Apple Silicon + Windows CUDA perf: 60 FPS pipeline, cross-platform routing Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA fixes, and Mac/Windows runtime routing into a single drop. CoreML (Apple Silicon): - Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0). - Fold Shape/Gather chains to constants in det_10g (21ms → 4ms). - Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms). - Route detection model to GPU so the ANE is free for the swap model. - Centralize provider/config selection in create_onnx_session. Pipeline (all platforms): - Parallelize face landmark + recognition post-detection; skip landmark_2d_106 when only face_swapper is active. - Pipeline face detection with swap for ANE overlap. - GPU-accelerated paste_back, MJPEG capture, zero-copy display path. - Standalone pipeline benchmark script. Windows / CUDA: - CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS. - Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution. Cross-platform: - platform_info helper for Mac/Windows runtime routing. - GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 10:44:59 +02:00
KRSHH	64d3f06089	Delete tests directory	2026-04-19 17:36:33 +05:30
Kenneth Estanislao	fceafcb234	Merge pull request #1751 from Gujiassh/fix/face-mask-none-frame-guard fix(face-mask): guard create_face_mask against None frame	2026-04-15 14:13:18 +08:00
Kenneth Estanislao	033475b89c	Update version in README from 2.1.2 to 2.1.3	2026-04-15 01:29:59 +08:00
Kenneth Estanislao	07711af712	Update contributors section in README.md	2026-04-15 01:29:44 +08:00
Kenneth Estanislao	44664d8a7f	Merge pull request #1746 from maxwbuckley/apple-silicon-perf-optimizations Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss)	2026-04-15 01:25:51 +08:00
gujishh	15a3f537a4	test: cover additional invalid frame guards	2026-04-13 21:09:27 +09:00
gujishh	fbcea9e135	fix(face-mask): guard create_face_mask against None frame	2026-04-12 14:19:48 +09:00
Max Buckley	646b0f816f	Move hot-path imports to module scope Address Sourcery review feedback: move face_align and get_one_face imports from inside per-frame functions to module-level to avoid repeated attribute lookup overhead in the processing loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:34:53 +02:00
Max Buckley	bcdd0ce2dd	Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss) Fix CoreML execution provider falling back to CPU silently, eliminate redundant per-frame face detection, and optimize the paste-back blend to operate on the face bounding box instead of the full frame. All changes are quality-neutral (pixel-identical output verified) and benefit non-Mac platforms via the shared detection and paste-back improvements. Changes: - Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize) that caused ORT 1.24 to silently fall back to CPUExecutionProvider - Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead fake_diff code in insightface's inswapper (computed but never used) - process_frame() accepts optional pre-detected target_face to avoid redundant get_one_face() call (~30-40ms saved per frame, all platforms) - In-memory pipeline detects face once and shares across processors - Fix get_face_swapper() to fall back to FP16 model when FP32 absent - Fix pre_start() to accept either model variant (was FP16-only check) - Make tensorflow import conditional (fixes crash on macOS) - Add missing tqdm dep, make tensorflow/pygrabber platform-conditional Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:28:07 +02:00
Kenneth Estanislao	8703d394d6	ONNX CUDA exhaustive convolution search + IO binding	2026-04-09 16:34:27 +08:00
Kenneth Estanislao	69e3fc5611	Rendering optimization The PNG encode/decode alone was consuming significant CPU time per frame. This is eliminated entirely.	2026-04-09 16:25:22 +08:00
Kenneth Estanislao	2b26d5539e	supress error message Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.	2026-04-09 16:04:00 +08:00
Kenneth Estanislao	fea5a4c2d2	Merge pull request #1707 from rohanrathi99/main Switch to FP32 model by default, add run script	2026-04-05 23:19:17 +08:00
Kenneth Estanislao	51fb7a6ad6	Merge pull request #1722 from mvanhorn/osc/1654-face-enhancer-v2 fix(face-enhancer): add missing process_frame_v2 method	2026-04-05 23:16:52 +08:00
Kenneth Estanislao	6da4f398d5	Merge pull request #1731 from JiayuuWang/contribot/fix-readme-macos-python-version docs: fix inconsistent Python version references in macOS/Linux setup (fixes #1632)	2026-04-05 23:16:20 +08:00
Kenneth Estanislao	3e362383d8	Merge pull request #1732 from yetval/fix/cuda-vram-exhaustion-video-processing Fix CUDA VRAM exhaustion during video processing	2026-04-05 23:15:38 +08:00
yetval	11fb5bfbc6	Fix CUDA VRAM exhaustion during video processing (#1721 )	2026-04-02 22:59:41 -04:00
jacob-wang	586d8f3fb0	docs: fix inconsistent Python version references in macOS/Linux setup The macOS Apple Silicon section installed Python 3.11 but then referenced Python 3.10 in several places: - `brew install python-tk@3.10` → python-tk@3.11 - Linux comment "Ensure you use the installed Python 3.10" → 3.11 - CoreML section cross-reference "completed the macOS setup above using Python 3.10" → 3.11 - `python3.10 run.py` usage command → python3.11 - "You must use Python 3.10" note → 3.11 - `brew reinstall python-tk@3.10` troubleshooting tip → 3.11 - Removed `python@3.11` from the list of conflicting versions to uninstall (it is the required version, not a conflict) Fixes #1632	2026-04-03 10:33:11 +08:00
Kenneth Estanislao	1edc4bc298	DML Lock fixed for cuda and CPU	2026-04-01 23:56:01 +08:00
ozp3	1f3668f7c1	Delete DeepLiveCam.lnk remove lnk and bat files as requested	2026-04-01 23:56:01 +08:00
ozp3	3d16ee346f	Delete run-dml.bat remove lnk and bat files as requested	2026-04-01 23:56:01 +08:00
ozp3	ab834d5640	feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix	2026-04-01 23:56:01 +08:00
Kenneth Estanislao	bf8a89d20a	Merge pull request #1725 from jhihweijhan/fix/video-output-pipeline Fix missing video output reporting and encoding flow	2026-04-01 23:14:22 +08:00
Kenneth Estanislao	bb4ef4a133	Apply suggestion from @sourcery-ai[bot] Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>	2026-04-01 23:13:59 +08:00
Kenneth Estanislao	b6b6c741a2	Revert "Merge pull request #1710 from ozp3/amd-dml-optimization" This reverts commit `1b240a45fd`, reversing changes made to `d9a5500bdf`.	2026-04-01 22:33:01 +08:00
Kenneth Estanislao	1b240a45fd	Merge pull request #1710 from ozp3/amd-dml-optimization AMD GPU (DirectML) Optimization for Live Mode	2026-04-01 22:29:43 +08:00
ozp3	ecf02d0640	Delete DeepLiveCam.lnk remove lnk and bat files as requested	2026-04-01 16:46:28 +03:00
ozp3	0cbc9f126f	Delete run-dml.bat remove lnk and bat files as requested	2026-04-01 16:45:31 +03:00
Karl	a3fd56a312	Fix missing video output reporting and encoding flow	2026-04-01 15:22:09 +08:00
Matt Van Horn	9525d45291	fix(face-enhancer): add missing process_frame_v2 method The live webcam preview in ui.py calls process_frame_v2() on all frame processors, but face_enhancer.py was missing this method. This caused an AttributeError crash when the GFPGAN face enhancer was enabled during live mode. Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:49:12 -07:00
Kenneth Estanislao	d9a5500bdf	Merge pull request #1713 from TeachDian/fix-1705-wsl-onnxruntime-gpu	2026-03-29 04:54:34 +08:00
TeachDian	86134b6e1d	Fix #1705 : Update onnxruntime-gpu requirement to 1.23.2 for WSL compatibility	2026-03-29 04:46:48 +08:00
ozp3	fbd1cc5973	docs: add AMD DML optimization notes to README	2026-03-28 13:16:43 +03:00
ozp3	eac2ad2307	feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix	2026-03-28 13:09:20 +03:00
Kenneth Estanislao	9e6f30c0a4	silenced deprecation	2026-03-27 21:35:27 +08:00
Kenneth Estanislao	97321a740d	Update face_analyser.py 320 was over optimized, put back to 640	2026-03-27 21:24:19 +08:00

1 2 3 4 5 ...

590 Commits