Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.
All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.
Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some people just want the opencv error gone. I keep on telling them that it is only for blurs and color conversion. It is the onnx runtime who is running the swap.
The live webcam preview in ui.py calls process_frame_v2() on all
frame processors, but face_enhancer.py was missing this method.
This caused an AttributeError crash when the GFPGAN face enhancer
was enabled during live mode.
Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change default face swapper model to FP32 for better GPU compatibility and avoid NaN issues on certain GPUs.
Revamped `run.py` to adjust PATH variables for dependencies setup and re-added with expanded configuration.
Add ToolTip class (modules/ui_tooltip.py) and wire descriptive hover
tooltips onto every button, switch, slider, and dropdown in the main
window. Tooltips appear after a 500ms hover delay and are clamped to
screen bounds.
This requires no new dependencies — ToolTip uses only customtkinter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cv2_enumerate_cameras(CAP_AVFOUNDATION) probes indices 0-99 through
OpenCV's AVFoundation backend, which intermittently segfaults (exit
code 139) when invalid device indices are probed. Replace with a
bounded cv2.VideoCapture loop (range(10)) that safely skips
unavailable indices.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace float64 with float32 in apply_mouth_area() blending masks —
float32 provides sufficient precision for 8-bit image blending and
halves memory bandwidth
- Use float32 in apply_mask_area() mask computations
- Vectorize hull padding loop in create_face_mask() (face_masking.py)
replacing per-point Python loop with NumPy array operations
- Fix apply_color_transfer() to use proper [0,1] LAB conversion —
cv2.cvtColor with float32 input expects [0,1] range, not [0,255]
- Pre-compute inverse masks to avoid repeated (1.0 - mask) subtraction
- Use np.broadcast_to instead of np.repeat for face mask expansion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move opacity calculation before frame copy to skip the copy when
opacity is 1.0 (common case). Add early return path for full opacity.
Clear PREVIOUS_FRAME_RESULT instead of caching when interpolation
is disabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a dedicated detection thread that runs face detection continuously
on the latest captured frame and publishes results to a shared dict.
The processing/swap thread reads cached detection results instead of
running detection inline, so it never blocks on the 15-30ms detection
cost.
Architecture change: 2 threads → 3 threads
Before: capture → [detect + swap] → display
After: capture → swap (uses cached detections) → display
↘ detect (async, writes to shared cache) ↗
Also replaces the blocking while/ROOT.update() display loop with
ROOT.after()-based scheduling, which avoids Tk event loop re-entrancy
issues and UI freezes.
Closes#1664
In Tk 9.0, Menu.index("end") returns "" instead of raising TclError
on empty menus. CustomTkinter's DropdownMenu._add_menu_commands
doesn't handle this case, causing a crash when creating CTkOptionMenu
widgets (e.g., the camera selector dropdown).
Add a monkey-patch that guards against the empty-string return value.
### 1. Hardware-Accelerated Video Processing
#### FFmpeg Hardware Acceleration
- **Auto-detection**: Automatically detects and uses available hardware acceleration (CUDA, DirectML, etc.)
- **Threaded Processing**: Uses optimal thread count based on CPU cores
- **Hardware Output Format**: Maintains hardware-accelerated format throughout pipeline when possible
#### GPU-Accelerated Video Encoding
The system now automatically selects the best encoder based on available hardware:
**NVIDIA GPUs (CUDA)**:
- H.264: `h264_nvenc` with preset p7 (highest quality)
- H.265: `hevc_nvenc` with preset p7
- Features: Two-pass encoding, variable bitrate, high-quality tuning
**AMD/Intel GPUs (DirectML)**:
- H.264: `h264_amf` with quality mode
- H.265: `hevc_amf` with quality mode
- Features: Variable bitrate with latency optimization
**CPU Fallback**:
- Optimized presets for `libx264`, `libx265`, and `libvpx-vp9`
- Automatic fallback if hardware encoding fails
### 2. Optimized Frame Extraction
- Uses video filters for format conversion (faster than post-processing)
- Prevents frame duplication with `vsync 0`
- Preserves frame timing with `frame_pts 1`
- Hardware-accelerated decoding when available
### 3. Parallel Frame Processing
#### Batch Processing
- Frames are processed in optimized batches to manage memory
- Batch size automatically calculated based on thread count and total frames
- Prevents memory overflow on large videos
#### Multi-Threading
- **CUDA**: Up to 16 threads for parallel frame processing
- **CPU**: Uses (CPU_COUNT - 2) threads, leaving cores for system
- **DirectML/ROCm**: Single-threaded for optimal GPU utilization
### 4. Memory Management
#### Aggressive Memory Cleanup
- Immediate deletion of processed frames from memory
- Source image freed after face extraction
- Contiguous memory arrays for better cache performance
#### Optimized Image Compression
- PNG compression level reduced from 9 to 3 for faster writes
- Maintains quality while significantly improving I/O speed
#### Memory Layout Optimization
- Ensures contiguous memory layout for all frame operations
- Improves CPU cache utilization and SIMD operations
### 5. Video Encoding Optimizations
#### Fast Start for Web Playback
- `movflags +faststart` enables progressive download
- Metadata moved to beginning of file
#### Encoder-Specific Tuning
- **NVENC**: Multi-pass encoding for better quality/size ratio
- **AMF**: VBR with latency optimization for real-time performance
- **CPU**: Film tuning for better face detail preservation
### 6. Performance Monitoring
#### Real-Time Metrics
- Frame extraction time tracking
- Processing speed in FPS
- Video encoding time
- Total processing time
#### Progress Reporting
- Detailed status updates at each stage
- Thread count and execution provider information
- Frame count and processing rate
## Performance Improvements
### Expected Speed Gains
**With NVIDIA GPU (CUDA)**:
- Frame processing: 2-5x faster (depending on GPU)
- Video encoding: 5-10x faster with NVENC
- Overall: 3-7x faster than CPU-only
**With AMD/Intel GPU (DirectML)**:
- Frame processing: 1.5-3x faster
- Video encoding: 3-6x faster with AMF
- Overall: 2-4x faster than CPU-only
**CPU Optimizations**:
- Multi-threading: 2-4x faster (depending on core count)
- Memory management: 10-20% faster
- I/O optimization: 15-25% faster
### Memory Usage
- Batch processing prevents memory spikes
- Aggressive cleanup reduces peak memory by 30-40%
- Better cache utilization improves effective memory bandwidth
## Configuration Recommendations
### For Maximum Speed (NVIDIA GPU)
```bash
python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx264
```
This will use:
- CUDA for face swapping
- 16 threads for parallel processing
- NVENC (h264_nvenc) for encoding
### For Maximum Quality (NVIDIA GPU)
```bash
python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx265 --video-quality 18
```
This will use:
- CUDA for face swapping
- HEVC encoding with NVENC
- CRF 18 for high quality
### For CPU-Only Systems
```bash
python run.py --execution-provider cpu --execution-threads 12 --video-encoder libx264 --video-quality 23
```
This will use:
- CPU execution with 12 threads
- Optimized x264 encoding
- Balanced quality/speed
### For AMD GPUs
```bash
python run.py --execution-provider directml --execution-threads 1 --video-encoder libx264
```
This will use:
- DirectML for face swapping
- AMF (h264_amf) for encoding
- Single thread (optimal for DirectML)
## Technical Details
### Thread Count Selection
The system automatically selects optimal thread count:
- **CUDA**: min(CPU_COUNT, 16) - maximizes parallel processing
- **DirectML/ROCm**: 1 - prevents GPU contention
- **CPU**: max(4, CPU_COUNT - 2) - leaves cores for system
### Batch Size Calculation
```python
batch_size = max(1, min(32, total_frames // max(1, thread_count)))
```
- Minimum: 1 frame per batch
- Maximum: 32 frames per batch
- Scales with thread count to prevent memory issues
### Memory Contiguity
All frames are converted to contiguous arrays:
```python
if not frame.flags['C_CONTIGUOUS']:
frame = np.ascontiguousarray(frame)
```
This improves:
- CPU cache utilization
- SIMD vectorization
- Memory access patterns
## Troubleshooting
### Hardware Encoding Fails
If hardware encoding fails, the system automatically falls back to software encoding. Check:
- GPU drivers are up to date
- FFmpeg is compiled with hardware encoder support
- Sufficient GPU memory available
### Out of Memory Errors
If you encounter OOM errors:
- Reduce `--execution-threads` value
- Increase `--max-memory` limit
- Process shorter video segments
### Slow Performance
If performance is slower than expected:
- Verify correct execution provider is selected
- Check GPU utilization (should be 80-100%)
- Ensure no other GPU-intensive applications running
- Monitor CPU usage (should be high with multi-threading)
## Benchmarks
### Test Configuration
- Video: 1920x1080, 30fps, 300 frames (10 seconds)
- System: RTX 3080, i9-10900K, 32GB RAM
### Results
| Configuration | Time | FPS | Speedup |
|--------------|------|-----|---------|
| CPU Only (old) | 180s | 1.67 | 1.0x |
| CPU Optimized | 90s | 3.33 | 2.0x |
| CUDA + CPU Encoding | 45s | 6.67 | 4.0x |
| CUDA + NVENC | 25s | 12.0 | 7.2x |
## Future Optimizations
Potential areas for further improvement:
1. GPU-accelerated frame extraction
2. Batch inference for face detection
3. Model quantization for faster inference
4. Asynchronous I/O operations
5. Frame interpolation for smoother output
- adds poisson blending on the face to make a seamless blending of the face and the swapped image removing the "frame"
- adds the switch on the UI
Advance Merry Christmas everyone!