Addresses Sourcery feedback on PR #1776: _get_soft_alpha caches a single
NxN template keyed by N, which is correct for the inswapper model
(128x128 aligned-face space) but would silently mis-warp if a caller
ever passed a non-square aligned face. Assert the shape instead of
silently assuming it.
Both face_swapper._fast_paste_back and face_enhancer._paste_back were
doing a numpy float32 round-trip per frame: convert the target crop and
the warped face to float32, blend, clip, cast back to uint8. That's four
crop-sized allocations plus unvectorized elementwise math.
Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add,
which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored
alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no
conversion is needed per frame. CUDA paths also simplified — upload uint8
alpha (less bandwidth) and scale on device.
Micro-bench on 1000x1000 RGB crop:
current (float32 numpy): 9.43 ms
cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255)
Visual diff is imperceptible (quantization noise in the last step).
_fast_paste_back used to erode and Gaussian-blur the warped alpha mask in
output coordinates with kernel sizes proportional to the on-screen face
bbox. That made the per-frame cost ~O(area * k^2) — a face filling half
the frame took ~8x the compositing work of one filling a quarter, which
is why FPS fell off when leaning into the camera.
Instead, build a feathered alpha template once at aligned-face resolution
(128x128 for inswapper) and warp the soft mask per-frame. The affine
transform preserves the relative feather width, so the visual output is
equivalent; the per-frame cost is now O(crop_area) with no size-scaled
erode/blur and no size-scaled padding.
Also collapses the CPU fallback onto the same shape — it previously did
a full-frame warpAffine twice per call, which scaled with the whole
frame instead of the face crop.
- core._run_pipe_pipeline: hand the background detector its own copy of
the frame. The frame processors mutate in place via paste-back, which
was racing with concurrent face detection on the same buffer.
- face_swapper._init_cuda_graph_session: replace the
`swapper.session.run` monkey-patch with a `_CudaGraphSessionAdapter`
that proxies every attribute to the underlying session and only
overrides `.run()`. Guarded so repeat init does not double-wrap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues surfaced in post-squash review of f65aeae:
1. CUDA-graph replay buffers were shared across threads with no lock.
`_cuda_graph_swap_inference` mutates module-level ort_input/ort_latent
and runs run_with_iobinding — concurrent swap calls on Windows/CUDA
could overwrite each other's bound input buffers before replay,
producing wrong-face output. Added `_cuda_graph_lock` around the
full update/run/read sequence.
2. Face enhancer loop unconditionally broke after the first face, so
`many_faces=True` silently enhanced only one face. Also, the
single-slot temporal cache would paste the same enhancement onto
every target if reused in many-faces mode. Gated the break on
`not many_faces_mode` and disabled the cache path in that mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles CoreML graph rewrites, GPU-accelerated pipeline work, Windows CUDA
fixes, and Mac/Windows runtime routing into a single drop.
CoreML (Apple Silicon):
- Decompose Pad(reflect) → Slice+Concat in inswapper_128 so the model
runs in one CoreML partition instead of 14 (TEMPORARY: fixed upstream
in microsoft/onnxruntime#28073, drop when ORT >= 1.26.0).
- Fold Shape/Gather chains to constants in det_10g (21ms → 4ms).
- Decompose Split(axis=1) → Slice pairs in GFPGAN (155ms → 89ms).
- Route detection model to GPU so the ANE is free for the swap model.
- Centralize provider/config selection in create_onnx_session.
Pipeline (all platforms):
- Parallelize face landmark + recognition post-detection; skip landmark_2d_106
when only face_swapper is active.
- Pipeline face detection with swap for ANE overlap.
- GPU-accelerated paste_back, MJPEG capture, zero-copy display path.
- Standalone pipeline benchmark script.
Windows / CUDA:
- CUDA graphs + FP16 model + all-GPU pipeline for 1080p 60 FPS.
- Auto-detect GPU provider and fix DLL discovery for Windows CUDA execution.
Cross-platform:
- platform_info helper for Mac/Windows runtime routing.
- GFPGAN 30 fps + MSMF camera 60 fps with adaptive pipeline tuning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Sourcery review feedback: move face_align and get_one_face
imports from inside per-frame functions to module-level to avoid
repeated attribute lookup overhead in the processing loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoreML execution provider falling back to CPU silently, eliminate
redundant per-frame face detection, and optimize the paste-back blend
to operate on the face bounding box instead of the full frame.
All changes are quality-neutral (pixel-identical output verified) and
benefit non-Mac platforms via the shared detection and paste-back
improvements.
Changes:
- Remove unsupported CoreML options (RequireStaticShapes, MaximumCacheSize)
that caused ORT 1.24 to silently fall back to CPUExecutionProvider
- Add _fast_paste_back(): bbox-restricted erode/blur/blend, skip dead
fake_diff code in insightface's inswapper (computed but never used)
- process_frame() accepts optional pre-detected target_face to avoid
redundant get_one_face() call (~30-40ms saved per frame, all platforms)
- In-memory pipeline detects face once and shares across processors
- Fix get_face_swapper() to fall back to FP16 model when FP32 absent
- Fix pre_start() to accept either model variant (was FP16-only check)
- Make tensorflow import conditional (fixes crash on macOS)
- Add missing tqdm dep, make tensorflow/pygrabber platform-conditional
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The live webcam preview in ui.py calls process_frame_v2() on all
frame processors, but face_enhancer.py was missing this method.
This caused an AttributeError crash when the GFPGAN face enhancer
was enabled during live mode.
Fixes https://github.com/hacksider/Deep-Live-Cam/issues/1654
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change default face swapper model to FP32 for better GPU compatibility and avoid NaN issues on certain GPUs.
Revamped `run.py` to adjust PATH variables for dependencies setup and re-added with expanded configuration.
- Replace float64 with float32 in apply_mouth_area() blending masks —
float32 provides sufficient precision for 8-bit image blending and
halves memory bandwidth
- Use float32 in apply_mask_area() mask computations
- Vectorize hull padding loop in create_face_mask() (face_masking.py)
replacing per-point Python loop with NumPy array operations
- Fix apply_color_transfer() to use proper [0,1] LAB conversion —
cv2.cvtColor with float32 input expects [0,1] range, not [0,255]
- Pre-compute inverse masks to avoid repeated (1.0 - mask) subtraction
- Use np.broadcast_to instead of np.repeat for face mask expansion
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move opacity calculation before frame copy to skip the copy when
opacity is 1.0 (common case). Add early return path for full opacity.
Clear PREVIOUS_FRAME_RESULT instead of caching when interpolation
is disabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
### 1. Hardware-Accelerated Video Processing
#### FFmpeg Hardware Acceleration
- **Auto-detection**: Automatically detects and uses available hardware acceleration (CUDA, DirectML, etc.)
- **Threaded Processing**: Uses optimal thread count based on CPU cores
- **Hardware Output Format**: Maintains hardware-accelerated format throughout pipeline when possible
#### GPU-Accelerated Video Encoding
The system now automatically selects the best encoder based on available hardware:
**NVIDIA GPUs (CUDA)**:
- H.264: `h264_nvenc` with preset p7 (highest quality)
- H.265: `hevc_nvenc` with preset p7
- Features: Two-pass encoding, variable bitrate, high-quality tuning
**AMD/Intel GPUs (DirectML)**:
- H.264: `h264_amf` with quality mode
- H.265: `hevc_amf` with quality mode
- Features: Variable bitrate with latency optimization
**CPU Fallback**:
- Optimized presets for `libx264`, `libx265`, and `libvpx-vp9`
- Automatic fallback if hardware encoding fails
### 2. Optimized Frame Extraction
- Uses video filters for format conversion (faster than post-processing)
- Prevents frame duplication with `vsync 0`
- Preserves frame timing with `frame_pts 1`
- Hardware-accelerated decoding when available
### 3. Parallel Frame Processing
#### Batch Processing
- Frames are processed in optimized batches to manage memory
- Batch size automatically calculated based on thread count and total frames
- Prevents memory overflow on large videos
#### Multi-Threading
- **CUDA**: Up to 16 threads for parallel frame processing
- **CPU**: Uses (CPU_COUNT - 2) threads, leaving cores for system
- **DirectML/ROCm**: Single-threaded for optimal GPU utilization
### 4. Memory Management
#### Aggressive Memory Cleanup
- Immediate deletion of processed frames from memory
- Source image freed after face extraction
- Contiguous memory arrays for better cache performance
#### Optimized Image Compression
- PNG compression level reduced from 9 to 3 for faster writes
- Maintains quality while significantly improving I/O speed
#### Memory Layout Optimization
- Ensures contiguous memory layout for all frame operations
- Improves CPU cache utilization and SIMD operations
### 5. Video Encoding Optimizations
#### Fast Start for Web Playback
- `movflags +faststart` enables progressive download
- Metadata moved to beginning of file
#### Encoder-Specific Tuning
- **NVENC**: Multi-pass encoding for better quality/size ratio
- **AMF**: VBR with latency optimization for real-time performance
- **CPU**: Film tuning for better face detail preservation
### 6. Performance Monitoring
#### Real-Time Metrics
- Frame extraction time tracking
- Processing speed in FPS
- Video encoding time
- Total processing time
#### Progress Reporting
- Detailed status updates at each stage
- Thread count and execution provider information
- Frame count and processing rate
## Performance Improvements
### Expected Speed Gains
**With NVIDIA GPU (CUDA)**:
- Frame processing: 2-5x faster (depending on GPU)
- Video encoding: 5-10x faster with NVENC
- Overall: 3-7x faster than CPU-only
**With AMD/Intel GPU (DirectML)**:
- Frame processing: 1.5-3x faster
- Video encoding: 3-6x faster with AMF
- Overall: 2-4x faster than CPU-only
**CPU Optimizations**:
- Multi-threading: 2-4x faster (depending on core count)
- Memory management: 10-20% faster
- I/O optimization: 15-25% faster
### Memory Usage
- Batch processing prevents memory spikes
- Aggressive cleanup reduces peak memory by 30-40%
- Better cache utilization improves effective memory bandwidth
## Configuration Recommendations
### For Maximum Speed (NVIDIA GPU)
```bash
python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx264
```
This will use:
- CUDA for face swapping
- 16 threads for parallel processing
- NVENC (h264_nvenc) for encoding
### For Maximum Quality (NVIDIA GPU)
```bash
python run.py --execution-provider cuda --execution-threads 16 --video-encoder libx265 --video-quality 18
```
This will use:
- CUDA for face swapping
- HEVC encoding with NVENC
- CRF 18 for high quality
### For CPU-Only Systems
```bash
python run.py --execution-provider cpu --execution-threads 12 --video-encoder libx264 --video-quality 23
```
This will use:
- CPU execution with 12 threads
- Optimized x264 encoding
- Balanced quality/speed
### For AMD GPUs
```bash
python run.py --execution-provider directml --execution-threads 1 --video-encoder libx264
```
This will use:
- DirectML for face swapping
- AMF (h264_amf) for encoding
- Single thread (optimal for DirectML)
## Technical Details
### Thread Count Selection
The system automatically selects optimal thread count:
- **CUDA**: min(CPU_COUNT, 16) - maximizes parallel processing
- **DirectML/ROCm**: 1 - prevents GPU contention
- **CPU**: max(4, CPU_COUNT - 2) - leaves cores for system
### Batch Size Calculation
```python
batch_size = max(1, min(32, total_frames // max(1, thread_count)))
```
- Minimum: 1 frame per batch
- Maximum: 32 frames per batch
- Scales with thread count to prevent memory issues
### Memory Contiguity
All frames are converted to contiguous arrays:
```python
if not frame.flags['C_CONTIGUOUS']:
frame = np.ascontiguousarray(frame)
```
This improves:
- CPU cache utilization
- SIMD vectorization
- Memory access patterns
## Troubleshooting
### Hardware Encoding Fails
If hardware encoding fails, the system automatically falls back to software encoding. Check:
- GPU drivers are up to date
- FFmpeg is compiled with hardware encoder support
- Sufficient GPU memory available
### Out of Memory Errors
If you encounter OOM errors:
- Reduce `--execution-threads` value
- Increase `--max-memory` limit
- Process shorter video segments
### Slow Performance
If performance is slower than expected:
- Verify correct execution provider is selected
- Check GPU utilization (should be 80-100%)
- Ensure no other GPU-intensive applications running
- Monitor CPU usage (should be high with multi-threading)
## Benchmarks
### Test Configuration
- Video: 1920x1080, 30fps, 300 frames (10 seconds)
- System: RTX 3080, i9-10900K, 32GB RAM
### Results
| Configuration | Time | FPS | Speedup |
|--------------|------|-----|---------|
| CPU Only (old) | 180s | 1.67 | 1.0x |
| CPU Optimized | 90s | 3.33 | 2.0x |
| CUDA + CPU Encoding | 45s | 6.67 | 4.0x |
| CUDA + NVENC | 25s | 12.0 | 7.2x |
## Future Optimizations
Potential areas for further improvement:
1. GPU-accelerated frame extraction
2. Batch inference for face detection
3. Model quantization for faster inference
4. Asynchronous I/O operations
5. Frame interpolation for smoother output
- adds poisson blending on the face to make a seamless blending of the face and the swapped image removing the "frame"
- adds the switch on the UI
Advance Merry Christmas everyone!
- Use models_dir instead of abs_dir for download path
- Create models directory if it doesn't exist
- Fix Hugging Face download URL by using /resolve/ instead of /blob/
- Add explicit checks for face detection results (source and target faces).
- Handle cases when face embeddings are not available, preventing AttributeError.
- Provide meaningful log messages for easier debugging in future scenarios.