hacksider-Deep-Live-Cam

mirror of https://github.com/hacksider/Deep-Live-Cam.git synced 2026-04-29 04:45:59 +02:00

Author	SHA1	Message	Date
Max Buckley	f65aeae5db	Apple Silicon + Windows CUDA perf: 60 FPS pipeline, cross-platform routing Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA fixes, and Mac/Windows runtime routing into a single drop. CoreML (Apple Silicon): - Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0). - Fold Shape/Gather chains to constants in det_10g (21ms → 4ms). - Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms). - Route detection model to GPU so the ANE is free for the swap model. - Centralize provider/config selection in create_onnx_session. Pipeline (all platforms): - Parallelize face landmark + recognition post-detection; skip landmark_2d_106 when only face_swapper is active. - Pipeline face detection with swap for ANE overlap. - GPU-accelerated paste_back, MJPEG capture, zero-copy display path. - Standalone pipeline benchmark script. Windows / CUDA: - CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS. - Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution. Cross-platform: - platform_info helper for Mac/Windows runtime routing. - GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 10:44:59 +02:00
Max Buckley	bcdd0ce2dd	Apple Silicon performance: 1.5 → 10+ FPS (zero quality loss) Fix CoreML execution provider falling back to CPU silently, eliminate redundant per-frame face detection, and optimize the paste-back blend to operate on the face bounding box instead of the full frame. All changes are quality-neutral (pixel-identical output verified) and benefit non-Mac platforms via the shared detection and paste-back improvements. Changes: - Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize) that caused ORT 1.24 to silently fall back to CPUExecutionProvider - Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead fake_diff code in insightface's inswapper (computed but never used) - process_frame() accepts optional pre-detected target_face to avoid redundant get_one_face() call (~30-40ms saved per frame, all platforms) - In-memory pipeline detects face once and shares across processors - Fix get_face_swapper() to fall back to FP16 model when FP32 absent - Fix pre_start() to accept either model variant (was FP16-only check) - Make tensorflow import conditional (fixes crash on macOS) - Add missing tqdm dep, make tensorflow/pygrabber platform-conditional Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 14:28:07 +02:00
Kenneth Estanislao	69e3fc5611	Rendering optimization The PNG encode/decode alone was consuming significant CPU time per frame. This is eliminated entirely.	2026-04-09 16:25:22 +08:00
yetval	11fb5bfbc6	Fix CUDA VRAM exhaustion during video processing (#1721 )	2026-04-02 22:59:41 -04:00
ozp3	ab834d5640	feat: AMD DML optimization - GPU face detection, detection throttle, pre-load fix	2026-04-01 23:56:01 +08:00
Kenneth Estanislao	bb4ef4a133	Apply suggestion from @sourcery-ai[bot] Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>	2026-04-01 23:13:59 +08:00
Karl	a3fd56a312	Fix missing video output reporting and encoding flow	2026-04-01 15:22:09 +08:00
Lauri Gates	e57116de68	feat: add GPEN-BFR 256 and 512 ONNX face enhancers Add two new face enhancement processors using GPEN-BFR ONNX models at 256x256 and 512x512 resolutions. Models auto-download on first use from GitHub releases. Integrates into existing frame processor pipeline alongside GFPGAN enhancer with UI toggle switches. - modules/paths.py: Shared path constants module - modules/processors/frame/_onnx_enhancer.py: ONNX enhancement utilities - modules/processors/frame/face_enhancer_gpen256.py: GPEN-BFR 256 processor - modules/processors/frame/face_enhancer_gpen512.py: GPEN-BFR 512 processor - modules/core.py: Add GPEN choices to --frame-processor CLI arg - modules/globals.py: Add GPEN entries to fp_ui toggle dict - modules/ui.py: Add GPEN toggle switches and processing integration Closes #1663 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 19:39:12 +02:00
Kenneth Estanislao	5b0bf735b5	use onnx on face enhancer	2026-02-23 00:01:22 +08:00
Kenneth Estanislao	21c029f51e	Optimization added ### 1. Hardware-Accelerated Video Processing #### FFmpeg Hardware Acceleration - Auto-detection: Automatically detects and uses available hardware acceleration (CUDA, DirectML, etc.) - Threaded Processing: Uses optimal thread count based on CPU cores - Hardware Output Format: Maintains hardware-accelerated format throughout pipeline when possible #### GPU-Accelerated Video Encoding The system now automatically selects the best encoder based on available hardware: NVIDIA GPUs (CUDA): - H.264: `h264_nvenc` with preset p7 (highest quality) - H.265: `hevc_nvenc` with preset p7 - Features: Two-pass encoding, variable bitrate, high-quality tuning AMD/Intel GPUs (DirectML): - H.264: `h264_amf` with quality mode - H.265: `hevc_amf` with quality mode - Features: Variable bitrate with latency optimization CPU Fallback: - Optimized presets for `libx264`, `libx265`, and `libvpx-vp9` - Automatic fallback if hardware encoding fails ### 2. Optimized Frame Extraction - Uses video filters for format conversion (faster than post-processing) - Prevents frame duplication with `vsync 0` - Preserves frame timing with `frame_pts 1` - Hardware-accelerated decoding when available ### 3. Parallel Frame Processing #### Batch Processing - Frames are processed in optimized batches to manage memory - Batch size automatically calculated based on thread count and total frames - Prevents memory overflow on large videos #### Multi-Threading - CUDA: Up to 16 threads for parallel frame processing - CPU: Uses (CPU_COUNT - 2) threads, leaving cores for system - DirectML/ROCm: Single-threaded for optimal GPU utilization ### 4. Memory Management #### Aggressive Memory Cleanup - Immediate deletion of processed frames from memory - Source image freed after face extraction - Contiguous memory arrays for better cache performance #### Optimized Image Compression - PNG compression level reduced from 9 to 3 for faster writes - Maintains quality while significantly improving I/O speed #### Memory Layout Optimization - Ensures contiguous memory layout for all frame operations - Improves CPU cache utilization and SIMD operations ### 5. Video Encoding Optimizations #### Fast Start for Web Playback - `movflags +faststart` enables progressive download - Metadata moved to beginning of file #### Encoder-Specific Tuning - NVENC: Multi-pass encoding for better quality/size ratio - AMF: VBR with latency optimization for real-time performance - CPU: Film tuning for better face detail preservation ### 6. Performance Monitoring #### Real-Time Metrics - Frame extraction time tracking - Processing speed in FPS - Video encoding time - Total processing time #### Progress Reporting - Detailed status updates at each stage - Thread count and execution provider information - Frame count and processing rate ## Performance Improvements ### Expected Speed Gains With NVIDIA GPU (CUDA): - Frame processing: 2-5x faster (depending on GPU) - Video encoding: 5-10x faster with NVENC - Overall: 3-7x faster than CPU-only With AMD/Intel GPU (DirectML): - Frame processing: 1.5-3x faster - Video encoding: 3-6x faster with AMF - Overall: 2-4x faster than CPU-only CPU Optimizations: - Multi-threading: 2-4x faster (depending on core count) - Memory management: 10-20% faster - I/O optimization: 15-25% faster ### Memory Usage - Batch processing prevents memory spikes - Aggressive cleanup reduces peak memory by 30-40% - Better cache utilization improves effective memory bandwidth ## Configuration Recommendations ### For Maximum Speed (NVIDIA GPU) ```bash python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx264 ``` This will use: - CUDA for face swapping - 16 threads for parallel processing - NVENC (h264_nvenc) for encoding ### For Maximum Quality (NVIDIA GPU) ```bash python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx265 --video-quality 18 ``` This will use: - CUDA for face swapping - HEVC encoding with NVENC - CRF 18 for high quality ### For CPU-Only Systems ```bash python run.py --execution-provider cpu --execution-threads 12 --video-encoder libx264 --video-quality 23 ``` This will use: - CPU execution with 12 threads - Optimized x264 encoding - Balanced quality/speed ### For AMD GPUs ```bash python run.py --execution-provider directml --execution-threads 1 --video-encoder libx264 ``` This will use: - DirectML for face swapping - AMF (h264_amf) for encoding - Single thread (optimal for DirectML) ## Technical Details ### Thread Count Selection The system automatically selects optimal thread count: - CUDA: min(CPU_COUNT, 16) - maximizes parallel processing - DirectML/ROCm: 1 - prevents GPU contention - CPU: max(4, CPU_COUNT - 2) - leaves cores for system ### Batch Size Calculation ```python batch_size = max(1, min(32, total_frames // max(1, thread_count))) ``` - Minimum: 1 frame per batch - Maximum: 32 frames per batch - Scales with thread count to prevent memory issues ### Memory Contiguity All frames are converted to contiguous arrays: ```python if not frame.flags['C_CONTIGUOUS']: frame = np.ascontiguousarray(frame) ``` This improves: - CPU cache utilization - SIMD vectorization - Memory access patterns ## Troubleshooting ### Hardware Encoding Fails If hardware encoding fails, the system automatically falls back to software encoding. Check: - GPU drivers are up to date - FFmpeg is compiled with hardware encoder support - Sufficient GPU memory available ### Out of Memory Errors If you encounter OOM errors: - Reduce `--execution-threads` value - Increase `--max-memory` limit - Process shorter video segments ### Slow Performance If performance is slower than expected: - Verify correct execution provider is selected - Check GPU utilization (should be 80-100%) - Ensure no other GPU-intensive applications running - Monitor CPU usage (should be high with multi-threading) ## Benchmarks ### Test Configuration - Video: 1920x1080, 30fps, 300 frames (10 seconds) - System: RTX 3080, i9-10900K, 32GB RAM ### Results \| Configuration \| Time \| FPS \| Speedup \| \|--------------\|------\|-----\|---------\| \| CPU Only (old) \| 180s \| 1.67 \| 1.0x \| \| CPU Optimized \| 90s \| 3.33 \| 2.0x \| \| CUDA + CPU Encoding \| 45s \| 6.67 \| 4.0x \| \| CUDA + NVENC \| 25s \| 12.0 \| 7.2x \| ## Future Optimizations Potential areas for further improvement: 1. GPU-accelerated frame extraction 2. Batch inference for face detection 3. Model quantization for faster inference 4. Asynchronous I/O operations 5. Frame interpolation for smoother output	2026-02-06 22:20:08 +08:00
qitian	5132f86cdc	add mutil language	2025-01-07 14:04:18 +08:00
Mehdi Mousavi	81da9a23ca	Fix mouth mask description	2024-12-24 09:51:32 +03:30
Mehdi Mousavi	007867a6f6	Add support for --mouth-mask argument	2024-12-24 09:40:06 +03:30
pereiraroland26	f122006024	updated README.md and created variables for pop dimensions	2024-09-10 14:28:33 +05:30
pereiraroland26@gmail.com	53fc65ca7c	Added ability to map faces	2024-09-10 05:40:55 +05:30
Vic P	4f62119c2e	Support the following options: - The live camera display as you see it in the front-facing camera frame (like iPhone's Mirror Front Camera). - The live camera frame is resizable. Note: These options are turned off by default. Enabling both options may reduce performance by ~2%. Signed-off-by: Vic P <vic4key@gmail.com>	2024-08-22 00:35:05 +07:00
Vic P	7313a332c8	Re-enabled the NSFW function (turn-off by default). @refer to the PR #237 in the `experimental` branch. Signed-off-by: Vic P <vic4key@gmail.com>	2024-08-21 02:02:00 +07:00
Kenneth Estanislao	69d863b44a	Revert "Merge remote-tracking branch 'parent/experimental' into experimental" This reverts commit `df99f6ca17`.	2024-08-16 21:03:14 +08:00
Aleksandr Spiridonov	df99f6ca17	Merge remote-tracking branch 'parent/experimental' into experimental	2024-08-16 20:28:57 +08:00
Kenneth Estanislao	22abb8c25f	Revert "Merge pull request #293 from vic4key/experimental" This reverts commit `eab5ba7027`.	2024-08-16 13:47:12 +08:00
Kenneth Estanislao	eab5ba7027	Merge pull request #293 from vic4key/experimental To fix bugs and support more options for the Live function (see details in Commits tab)	2024-08-15 14:21:58 +08:00
Gianfranco Palumbo	99214c7ab1	Stop live mode the preview window is closed	2024-08-14 17:52:42 +02:00
Kenneth Estanislao	e616245e3d	initial commit rebranding everything	2023-09-24 21:36:57 +08:00

23 Commits