hacksider-Deep-Live-Cam

mirror of https://github.com/hacksider/Deep-Live-Cam.git synced 2026-07-13 21:47:22 +02:00

Author	SHA1	Message	Date
Kenneth Estanislao	834bc43768	Support non-ascii characters	2026-06-14 20:18:56 +08:00
Max Buckley	cfa8123b67	Add ruff CI gate and fix deterministic lint issues Introduces pyproject.toml + .github/workflows/ruff.yml that gate E701, E711, E712, F401, F541 on every PR and push to main. Fixes the existing findings for those rules: - Remove unused imports (sklearn.silhouette_score, numpy in several files, typing.Optional, get_one_face, gpu_cvt_color, sys, insightface.face_align) - Annotate the intentional tkinter_fix side-effect import with `# noqa: F401` - Split multi-statement `if x: y` one-liners onto separate lines - Replace `state == True` / `state == False` with truthiness checks - Drop `f` prefix from f-strings with no placeholders F841 (unused-variable), E402 (module-level-import-not-at-top), and F821 (undefined-name) are left out of the gate for now — they surface real findings (including a latent NameError in face_swapper.py) that require human review to fix safely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 15:44:31 +02:00
hklcf	886e64b320	Fix: resolve 5 confirmed bugs (imwrite_unicode, macOS memory, face_analyser None crash, silent sys.exit, core memory calc)	2026-05-23 10:37:20 +08:00
Kenneth Estanislao	ca8e39e3bb	Fix mouth mask	2026-05-18 02:11:04 +08:00
Kenneth Estanislao	0e97e474e4	better swapping	2026-05-18 01:40:01 +08:00
Kenneth Estanislao	9c67a7aacc	fixed poisson blend	2026-05-18 01:36:24 +08:00
Kenneth Estanislao	9c5f01c7f1	some fix for face enhancers	2026-05-15 15:13:57 +08:00
Ihor Kuzmychov	297acded3b	fix: use BORDER_REPLICATE for face warp to eliminate black border	2026-04-23 19:42:32 +02:00
Max Buckley	f95a0bb7fb	Make square aligned-face assumption explicit in _fast_paste_back Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single NxN template keyed by N, which is correct for the inswapper model (128x128 aligned-face space) but would silently mis-warp if a caller ever passed a non-square aligned face. Assert the shape instead of silently assuming it.	2026-04-22 13:40:18 +02:00
Max Buckley	cbf0859347	Paste-back blend: uint8 cv2 SIMD, no float32 round-trip Both face_swapper._fast_paste_back and face_enhancer._paste_back were doing a numpy float32 round-trip per frame: convert the target crop and the warped face to float32, blend, clip, cast back to uint8. That's four crop-sized allocations plus unvectorized elementwise math. Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add, which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no conversion is needed per frame. CUDA paths also simplified — upload uint8 alpha (less bandwidth) and scale on device. Micro-bench on 1000x1000 RGB crop: current (float32 numpy): 9.43 ms cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255) Visual diff is imperceptible (quantization noise in the last step).	2026-04-22 12:05:39 +02:00
Max Buckley	a6c99607fc	Cut paste-back from quartic to linear in face size _fast_paste_back used to erode and Gaussian-blur the warped alpha mask in output coordinates with kernel sizes proportional to the on-screen face bbox. That made the per-frame cost ~O(area * k^2) — a face filling half the frame took ~8x the compositing work of one filling a quarter, which is why FPS fell off when leaning into the camera. Instead, build a feathered alpha template once at aligned-face resolution (128x128 for inswapper) and warp the soft mask per-frame. The affine transform preserves the relative feather width, so the visual output is equivalent; the per-frame cost is now O(crop_area) with no size-scaled erode/blur and no size-scaled padding. Also collapses the CPU fallback onto the same shape — it previously did a full-frame warpAffine twice per call, which scaled with the whole frame instead of the face crop.	2026-04-22 11:58:02 +02:00
Max Buckley	0a87d63560	Address PR #1775 review: pipelined-detection race and CUDA-graph monkey-patch - core._run_pipe_pipeline: hand the background detector its own copy of the frame. The frame processors mutate in place via paste-back, which was racing with concurrent face detection on the same buffer. - face_swapper._init_cuda_graph_session: replace the `swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter` that proxies every attribute to the underlying session and only overrides `.run()`. Guarded so repeat init does not double-wrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:45:59 +02:00
Max Buckley	4d04e830bc	Fix CUDA-graph replay race + many_faces enhancer regression Two issues surfaced in post-squash review of `f65aeae`: 1. CUDA-graph replay buffers were shared across threads with no lock. `_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent and runs run_with_iobinding — concurrent swap calls on Windows/CUDA could overwrite each other's bound input buffers before replay, producing wrong-face output. Added `_cuda_graph_lock` around the full update/run/read sequence. 2. Face enhancer loop unconditionally broke after the first face, so `many_faces=True` silently enhanced only one face. Also, the single-slot temporal cache would paste the same enhancement onto every target if reused in many-faces mode. Gated the break on `not many_faces_mode` and disabled the cache path in that mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:08:23 +02:00
Max Buckley	f65aeae5db	Apple Silicon + Windows CUDA perf: 60 FPS pipeline, cross-platform routing Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA fixes, and Mac/Windows runtime routing into a single drop. CoreML (Apple Silicon): - Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0). - Fold Shape/Gather chains to constants in det_10g (21ms → 4ms). - Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms). - Route detection model to GPU so the ANE is free for the swap model. - Centralize provider/config selection in create_onnx_session. Pipeline (all platforms): - Parallelize face landmark + recognition post-detection; skip landmark_2d_106 when only face_swapper is active. - Pipeline face detection with swap for ANE overlap. - GPU-accelerated paste_back, MJPEG capture, zero-copy display path. - Standalone pipeline benchmark script. Windows / CUDA: - CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS. - Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution. Cross-platform: - platform_info helper for Mac/Windows runtime routing. - GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 10:44:59 +02:00
Kenneth Estanislao	fceafcb234	Merge pull request #1751 from Gujiassh/fix/face-mask-none-frame-guard fix(face-mask): guard create_face_mask against None frame	2026-04-15 14:13:18 +08:00
gujishh	fbcea9e135	fix(face-mask): guard create_face_mask against None frame	2026-04-12 14:19:48 +09:00
Max Buckley	646b0f816f	Move hot-path imports to module scope Address Sourcery review feedback: move face_align and get_one_face imports from inside per-frame functions to module-level to avoid repeated attribute lookup overhead in the processing loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:34:53 +02:00
Max Buckley	bcdd0ce2dd	Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss) Fix CoreML execution provider falling back to CPU silently, eliminate redundant per-frame face detection, and optimize the paste-back blend to operate on the face bounding box instead of the full frame. All changes are quality-neutral (pixel-identical output verified) and benefit non-Mac platforms via the shared detection and paste-back improvements. Changes: - Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize) that caused ORT 1.24 to silently fall back to CPUExecutionProvider - Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead fake_diff code in insightface's inswapper (computed but never used) - process_frame() accepts optional pre-detected target_face to avoid redundant get_one_face() call (~30-40ms saved per frame, all platforms) - In-memory pipeline detects face once and shares across processors - Fix get_face_swapper() to fall back to FP16 model when FP32 absent - Fix pre_start() to accept either model variant (was FP16-only check) - Make tensorflow import conditional (fixes crash on macOS) - Add missing tqdm dep, make tensorflow/pygrabber platform-conditional Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:28:07 +02:00
Kenneth Estanislao	8703d394d6	ONNX CUDA exhaustive convolution search + IO binding	2026-04-09 16:34:27 +08:00
Kenneth Estanislao	69e3fc5611	Rendering optimization The PNG encode/decode alone was consuming significant CPU time per frame. This is eliminated entirely.	2026-04-09 16:25:22 +08:00
Kenneth Estanislao	fea5a4c2d2	Merge pull request #1707 from rohanrathi99/main Switch to FP32 model by default, add run script	2026-04-05 23:19:17 +08:00
Kenneth Estanislao	51fb7a6ad6	Merge pull request #1722 from mvanhorn/osc/1654-face-enhancer-v2 fix(face-enhancer): add missing process_frame_v2 method	2026-04-05 23:16:52 +08:00
yetval	11fb5bfbc6	Fix CUDA VRAM exhaustion during video processing (#1721 )	2026-04-02 22:59:41 -04:00
Kenneth Estanislao	1edc4bc298	DML Lock fixed for cuda and CPU	2026-04-01 23:56:01 +08:00
ozp3	ab834d5640	feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix	2026-04-01 23:56:01 +08:00
Kenneth Estanislao	b6b6c741a2	Revert "Merge pull request #1710 from ozp3/amd-dml-optimization" This reverts commit `1b240a45fd`, reversing changes made to `d9a5500bdf`.	2026-04-01 22:33:01 +08:00
Matt Van Horn	9525d45291	fix(face-enhancer): add missing process_frame_v2 method The live webcam preview in ui.py calls process_frame_v2() on all frame processors, but face_enhancer.py was missing this method. This caused an AttributeError crash when the GFPGAN face enhancer was enabled during live mode. Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:49:12 -07:00
ozp3	eac2ad2307	feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix	2026-03-28 13:09:20 +03:00
RohanW11p	9207386e07	Switch to FP32 model by default, add run script Change default face swapper model to FP32 for better GPU compatibility and avoid NaN issues on certain GPUs. Revamped `run.py` to adjust PATH variables for dependencies setup and re-added with expanded configuration.	2026-03-27 17:29:01 +05:30
Kenneth Estanislao	3c8b259a3f	Some edits on the UI - Grouped the face enhancers - Make the mouth mask just a slider - Removed the redundant switches	2026-03-13 22:03:28 +08:00
Kenneth Estanislao	0d8f3b1f82	Fix on vulnerability report https://github.com/hacksider/Deep-Live-Cam/issues/1695	2026-03-06 23:26:48 +08:00
Kenneth Estanislao	de01b28802	Merge pull request #1678 from laurigates/pr/perf-opacity-handling perf(face-swapper): optimize opacity handling and frame copies	2026-02-24 14:28:17 +08:00
Kenneth Estanislao	31b3a97003	Merge pull request #1680 from laurigates/pr/perf-float32-buffer-reuse perf(processing): optimize post-processing with float32 and buffer reuse	2026-02-23 15:13:03 +08:00
Lauri Gates	e93fb95903	perf(processing): optimize post-processing with float32 and buffer reuse - Replace float64 with float32 in apply_mouth_area() blending masks — float32 provides sufficient precision for 8-bit image blending and halves memory bandwidth - Use float32 in apply_mask_area() mask computations - Vectorize hull padding loop in create_face_mask() (face_masking.py) replacing per-point Python loop with NumPy array operations - Fix apply_color_transfer() to use proper [0,1] LAB conversion — cv2.cvtColor with float32 input expects [0,1] range, not [0,255] - Pre-compute inverse masks to avoid repeated (1.0 - mask) subtraction - Use np.broadcast_to instead of np.repeat for face mask expansion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 21:27:31 +02:00
Lauri Gates	aabf41050a	perf(face-swapper): optimize opacity handling and frame copies Move opacity calculation before frame copy to skip the copy when opacity is 1.0 (common case). Add early return path for full opacity. Clear PREVIOUS_FRAME_RESULT instead of caching when interpolation is disabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 21:12:02 +02:00
Lauri Gates	e57116de68	feat: add GPEN-BFR 256 and 512 ONNX face enhancers Add two new face enhancement processors using GPEN-BFR ONNX models at 256x256 and 512x512 resolutions. Models auto-download on first use from GitHub releases. Integrates into existing frame processor pipeline alongside GFPGAN enhancer with UI toggle switches. - modules/paths.py: Shared path constants module - modules/processors/frame/_onnx_enhancer.py: ONNX enhancement utilities - modules/processors/frame/face_enhancer_gpen256.py: GPEN-BFR 256 processor - modules/processors/frame/face_enhancer_gpen512.py: GPEN-BFR 512 processor - modules/core.py: Add GPEN choices to --frame-processor CLI arg - modules/globals.py: Add GPEN entries to fp_ui toggle dict - modules/ui.py: Add GPEN toggle switches and processing integration Closes #1663 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 19:39:12 +02:00
Kenneth Estanislao	e56a79222e	Merge branch 'main' of https://github.com/hacksider/Deep-Live-Cam	2026-02-23 00:01:36 +08:00
Kenneth Estanislao	5b0bf735b5	use onnx on face enhancer	2026-02-23 00:01:22 +08:00
Kenneth Estanislao	36bb1a29b0	Merge pull request #1189 from davidstrouk/main Fix model download path and URL	2026-02-22 23:55:13 +08:00
Kenneth Estanislao	f0ec0744f7	GPU Accelerated OpenCV	2026-02-12 19:44:04 +08:00
Kenneth Estanislao	9a33f5e184	better mouth mask better mouth mask showing and tracking the lips part only.	2026-02-10 12:21:42 +08:00
Kenneth Estanislao	21c029f51e	Optimization added ### 1. Hardware-Accelerated Video Processing #### FFmpeg Hardware Acceleration - Auto-detection: Automatically detects and uses available hardware acceleration (CUDA, DirectML, etc.) - Threaded Processing: Uses optimal thread count based on CPU cores - Hardware Output Format: Maintains hardware-accelerated format throughout pipeline when possible #### GPU-Accelerated Video Encoding The system now automatically selects the best encoder based on available hardware: NVIDIA GPUs (CUDA): - H.264: `h264_nvenc` with preset p7 (highest quality) - H.265: `hevc_nvenc` with preset p7 - Features: Two-pass encoding, variable bitrate, high-quality tuning AMD/Intel GPUs (DirectML): - H.264: `h264_amf` with quality mode - H.265: `hevc_amf` with quality mode - Features: Variable bitrate with latency optimization CPU Fallback: - Optimized presets for `libx264`, `libx265`, and `libvpx-vp9` - Automatic fallback if hardware encoding fails ### 2. Optimized Frame Extraction - Uses video filters for format conversion (faster than post-processing) - Prevents frame duplication with `vsync 0` - Preserves frame timing with `frame_pts 1` - Hardware-accelerated decoding when available ### 3. Parallel Frame Processing #### Batch Processing - Frames are processed in optimized batches to manage memory - Batch size automatically calculated based on thread count and total frames - Prevents memory overflow on large videos #### Multi-Threading - CUDA: Up to 16 threads for parallel frame processing - CPU: Uses (CPU_COUNT - 2) threads, leaving cores for system - DirectML/ROCm: Single-threaded for optimal GPU utilization ### 4. Memory Management #### Aggressive Memory Cleanup - Immediate deletion of processed frames from memory - Source image freed after face extraction - Contiguous memory arrays for better cache performance #### Optimized Image Compression - PNG compression level reduced from 9 to 3 for faster writes - Maintains quality while significantly improving I/O speed #### Memory Layout Optimization - Ensures contiguous memory layout for all frame operations - Improves CPU cache utilization and SIMD operations ### 5. Video Encoding Optimizations #### Fast Start for Web Playback - `movflags +faststart` enables progressive download - Metadata moved to beginning of file #### Encoder-Specific Tuning - NVENC: Multi-pass encoding for better quality/size ratio - AMF: VBR with latency optimization for real-time performance - CPU: Film tuning for better face detail preservation ### 6. Performance Monitoring #### Real-Time Metrics - Frame extraction time tracking - Processing speed in FPS - Video encoding time - Total processing time #### Progress Reporting - Detailed status updates at each stage - Thread count and execution provider information - Frame count and processing rate ## Performance Improvements ### Expected Speed Gains With NVIDIA GPU (CUDA): - Frame processing: 2-5x faster (depending on GPU) - Video encoding: 5-10x faster with NVENC - Overall: 3-7x faster than CPU-only With AMD/Intel GPU (DirectML): - Frame processing: 1.5-3x faster - Video encoding: 3-6x faster with AMF - Overall: 2-4x faster than CPU-only CPU Optimizations: - Multi-threading: 2-4x faster (depending on core count) - Memory management: 10-20% faster - I/O optimization: 15-25% faster ### Memory Usage - Batch processing prevents memory spikes - Aggressive cleanup reduces peak memory by 30-40% - Better cache utilization improves effective memory bandwidth ## Configuration Recommendations ### For Maximum Speed (NVIDIA GPU) ```bash python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx264 ``` This will use: - CUDA for face swapping - 16 threads for parallel processing - NVENC (h264_nvenc) for encoding ### For Maximum Quality (NVIDIA GPU) ```bash python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx265 --video-quality 18 ``` This will use: - CUDA for face swapping - HEVC encoding with NVENC - CRF 18 for high quality ### For CPU-Only Systems ```bash python run.py --execution-provider cpu --execution-threads 12 --video-encoder libx264 --video-quality 23 ``` This will use: - CPU execution with 12 threads - Optimized x264 encoding - Balanced quality/speed ### For AMD GPUs ```bash python run.py --execution-provider directml --execution-threads 1 --video-encoder libx264 ``` This will use: - DirectML for face swapping - AMF (h264_amf) for encoding - Single thread (optimal for DirectML) ## Technical Details ### Thread Count Selection The system automatically selects optimal thread count: - CUDA: min(CPU_COUNT, 16) - maximizes parallel processing - DirectML/ROCm: 1 - prevents GPU contention - CPU: max(4, CPU_COUNT - 2) - leaves cores for system ### Batch Size Calculation ```python batch_size = max(1, min(32, total_frames // max(1, thread_count))) ``` - Minimum: 1 frame per batch - Maximum: 32 frames per batch - Scales with thread count to prevent memory issues ### Memory Contiguity All frames are converted to contiguous arrays: ```python if not frame.flags['C_CONTIGUOUS']: frame = np.ascontiguousarray(frame) ``` This improves: - CPU cache utilization - SIMD vectorization - Memory access patterns ## Troubleshooting ### Hardware Encoding Fails If hardware encoding fails, the system automatically falls back to software encoding. Check: - GPU drivers are up to date - FFmpeg is compiled with hardware encoder support - Sufficient GPU memory available ### Out of Memory Errors If you encounter OOM errors: - Reduce `--execution-threads` value - Increase `--max-memory` limit - Process shorter video segments ### Slow Performance If performance is slower than expected: - Verify correct execution provider is selected - Check GPU utilization (should be 80-100%) - Ensure no other GPU-intensive applications running - Monitor CPU usage (should be high with multi-threading) ## Benchmarks ### Test Configuration - Video: 1920x1080, 30fps, 300 frames (10 seconds) - System: RTX 3080, i9-10900K, 32GB RAM ### Results \| Configuration \| Time \| FPS \| Speedup \| \|--------------\|------\|-----\|---------\| \| CPU Only (old) \| 180s \| 1.67 \| 1.0x \| \| CPU Optimized \| 90s \| 3.33 \| 2.0x \| \| CUDA + CPU Encoding \| 45s \| 6.67 \| 4.0x \| \| CUDA + NVENC \| 25s \| 12.0 \| 7.2x \| ## Future Optimizations Potential areas for further improvement: 1. GPU-accelerated frame extraction 2. Batch inference for face detection 3. Model quantization for faster inference 4. Asynchronous I/O operations 5. Frame interpolation for smoother output	2026-02-06 22:20:08 +08:00
Kenneth Estanislao	df8e8b427e	Adds Poisson blending - adds poisson blending on the face to make a seamless blending of the face and the swapped image removing the "frame" - adds the switch on the UI Advance Merry Christmas everyone!	2025-12-15 04:54:42 +08:00
Kenneth Estanislao	b3c4ed9250	optimization with mac Hoping this would solve the mac issues, if you're a mac user, please report if there is an improvement	2025-11-16 20:09:12 +08:00
Dung Le	a007db2ffa	fix: fix typos which cause "No faces found in target" issue	2025-11-09 15:51:14 +07:00
Kenneth Estanislao	b82fdc3f31	Update face_swapper.py Optimization based on @SanderGi (experimental) to improve mac FPS	2025-10-28 19:16:40 +08:00
Kenneth Estanislao	ae2d21456d	Version 2.0c Release! Sharpness and some other improvements added!	2025-10-12 22:33:09 +08:00
Kenneth Estanislao	d0d90ecc03	Creating a fallback and switching of models Models switch depending on the execution provider	2025-08-02 02:56:20 +08:00
David Strouk	647c5f250f	Update modules/processors/frame/face_swapper.py Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>	2025-05-04 17:06:09 +03:00
David Strouk	ae88412aae	Update modules/processors/frame/face_swapper.py Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>	2025-05-04 17:04:08 +03:00

1 2

82 Commits