New `obliteratus gpu-calc` subcommand estimates minimum GPU count from
model params, dtype, and GPU VRAM. Auto-detects param counts from HF
configs including MoE expert structure.
README now covers --dtype, --quantization flags, the gpu-calc command,
and references both in the "Choosing the right setup" table.
Add a comprehensive section covering:
- How model sharding (pipeline parallelism) works and its limitations
- GPU selection via --gpus flag
- Pipeline parallel benchmarks on GPT-OSS-120B across 3-8 A100-80GB GPUs
- Stage-by-stage timing breakdown
- When data parallelism helps (and when it doesn't)
- Remote SSH execution with CLI and YAML examples
- Decision table for choosing the right setup
The snapshot() deepcopy was cloning tensors on their original GPU
devices, doubling VRAM usage. For a 234GB model sharded across 6
A100-80GB GPUs (~39GB each), this left no room for the copy.
Now snapshot stores tensors on CPU and restore() moves them back
to each parameter's current device.
When --data-parallel is passed and the model fits on a single GPU,
wraps it with nn.DataParallel to split prompt batches across all
available GPUs during activation collection. Batch size scales by
GPU count. Hooks already move activations to CPU so they work
correctly across replicas.
Adds --gpus flag to obliterate, run, and tourney commands for controlling
which GPUs to use (sets CUDA_VISIBLE_DEVICES). Works both locally and with
--remote. Models are automatically split across selected GPUs via
accelerate's device_map="auto". Also adds gpus field to remote YAML config.
Adds --remote [user@]host flag to obliterate, run, and tourney commands,
enabling execution on remote GPU nodes via SSH. Also supports a remote:
section in YAML configs. The remote runner handles SSH connectivity checks,
GPU detection, auto-installation of obliteratus, log streaming, and result
syncing back to the local machine via scp.