mirror of
https://github.com/hacksider/Deep-Live-Cam.git
synced 2026-05-31 01:39:29 +02:00
cbf0859347
Both face_swapper._fast_paste_back and face_enhancer._paste_back were doing a numpy float32 round-trip per frame: convert the target crop and the warped face to float32, blend, clip, cast back to uint8. That's four crop-sized allocations plus unvectorized elementwise math. Replace with a fused uint8 blend using cv2.merge + cv2.multiply + cv2.add, which cv2 dispatches to SIMD (NEON on Apple Silicon / AVX on x86). Stored alpha templates switched from float32 [0, 1] to uint8 [0, 255] so no conversion is needed per frame. CUDA paths also simplified — upload uint8 alpha (less bandwidth) and scale on device. Micro-bench on 1000x1000 RGB crop: current (float32 numpy): 9.43 ms cv2 uint8 fused: 1.16 ms (8.1× faster, max diff 2/255) Visual diff is imperceptible (quantization noise in the last step).