Files
Stella Biderman 51f621d0a2 Save model snapshot to CPU to avoid OOM on multi-GPU setups
The snapshot() deepcopy was cloning tensors on their original GPU
devices, doubling VRAM usage. For a 234GB model sharded across 6
A100-80GB GPUs (~39GB each), this left no room for the copy.

Now snapshot stores tensors on CPU and restore() moves them back
to each parameter's current device.
2026-03-13 17:13:50 -04:00
..
2026-03-04 12:38:18 -08:00