Document multi-GPU parallelism, benchmarks, and remote SSH execution

Add a comprehensive section covering:
- How model sharding (pipeline parallelism) works and its limitations
- GPU selection via --gpus flag
- Pipeline parallel benchmarks on GPT-OSS-120B across 3-8 A100-80GB GPUs
- Stage-by-stage timing breakdown
- When data parallelism helps (and when it doesn't)
- Remote SSH execution with CLI and YAML examples
- Decision table for choosing the right setup
This commit is contained in:
Stella Biderman
2026-03-16 14:39:22 -04:00
parent 51f621d0a2
commit c723da02c8
+132
View File
@@ -415,6 +415,138 @@ Includes pre-liberated variants (Dolphin, Hermes, WhiteRabbitNeo) for A/B compar
obliteratus models
```
## Multi-GPU and remote execution
OBLITERATUS automatically shards models across multiple GPUs when they don't fit on a single card. It also supports remote execution over SSH, so you can run the pipeline on a GPU server from your laptop.
### How model sharding works
When you have multiple GPUs, OBLITERATUS uses accelerate's `device_map="auto"` to split the model's layers across all available GPUs. This is **naive pipeline parallelism** — layers are distributed evenly, but only one GPU computes at a time as activations flow sequentially through the layer stack. The other GPUs hold their assigned layers in memory but are idle until their turn.
This means multi-GPU sharding is a **memory solution, not a speed solution**. It lets you run models that don't fit on one GPU, but it won't make small models run faster. In fact, more GPUs can be *slower* due to inter-GPU data transfer overhead at layer boundaries.
### Selecting GPUs
Use `--gpus` to control which GPUs are used:
```bash
# Use all 8 GPUs (default)
obliteratus obliterate bigmodel/200B --gpus all
# Use only GPUs 0-3
obliteratus obliterate bigmodel/200B --gpus 0,1,2,3
# Use a specific pair
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5
```
This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then sharded across the selected GPUs.
### Pipeline parallel benchmarks
We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
| GPUs | Total time | VRAM/GPU | Notes |
|------|-----------|----------|-------|
| 3 | **FAILED** | ~78 GB | Not enough headroom for activations; some layers offloaded to CPU as meta tensors, crashes during EXCISE |
| 4 | **615s** (10m15s) | ~58 GB | Fastest. Fewest inter-GPU transfers. Snapshot auto-skipped (insufficient free VRAM) |
| 5 | 763s (12m43s) | ~47 GB | +24% slower than 4 GPUs |
| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |
Stage breakdown (approximately constant across GPU counts):
| Stage | Time | Bottleneck |
|-------|------|-----------|
| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
| PROBE (activations) | ~20s | Forward passes through sharded model |
| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
| VERIFY | ~210s | Forward passes on validation prompts |
| REBIRTH (save) | ~350s | Writing 234 GB to disk |
Key findings:
- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.
### When you actually need data parallelism
For models that fit on a single GPU with room to spare, the PROBE stage (which runs 1024 forward passes to collect activations) is the main computational bottleneck. Pipeline parallelism doesn't help here — it still processes one prompt at a time through the full layer stack.
True data parallelism (replicating the model and splitting prompts across GPUs) can speed up PROBE, but it requires enough VRAM to hold a full copy of the model on each GPU. An experimental pre-replicated data parallel implementation is available on the `data-parallel-prereplication` branch:
```bash
git checkout data-parallel-prereplication
obliteratus obliterate EleutherAI/pythia-12b --data-parallel
```
This deep-copies the model to each GPU once, then distributes prompt batches across replicas using a thread pool. Benchmarks on Pythia 12B (24 GB model, 8x A100-80GB):
| Mode | PROBE time | Notes |
|------|-----------|-------|
| Single GPU | 7.1s | Baseline |
| Pre-replicated DP (8 GPUs) | 7.7s | Near parity — PROBE is too fast at this scale for parallelism to help |
Data parallelism becomes more valuable as the prompt count or model size increases relative to the per-forward-pass cost. For most models, the overhead of replication exceeds the time saved.
### Remote execution over SSH
Run the full pipeline on a remote GPU node from your local machine. OBLITERATUS handles SSH connection, auto-installs itself on the remote if needed, streams logs in real time, and copies results back when done.
```bash
# Basic remote run
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
--remote user@gpu-node
# With SSH key and custom options
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
--remote root@10.0.0.5 \
--ssh-key ~/.ssh/id_rsa \
--ssh-port 2222 \
--remote-dir /data/obliteratus \
--remote-python python3.11
# Don't copy results back (keep on remote only)
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
--remote user@gpu-node --no-sync
```
Remote execution also works with `obliteratus run` (YAML configs) and `obliteratus tourney` (method comparison). You can specify remote settings in YAML:
```yaml
model:
name: meta-llama/Llama-3.1-70B-Instruct
dtype: float16
remote:
host: gpu-node
user: root
ssh_key: ~/.ssh/id_rsa
remote_dir: /tmp/obliteratus_run
gpus: "0,1,2,3" # select GPUs on the remote
sync_results: true # copy results back when done
```
The remote runner:
1. Tests SSH connectivity
2. Detects GPUs on the remote (`nvidia-smi`)
3. Installs obliteratus if not already present
4. Uploads config files if using `obliteratus run`
5. Runs the pipeline with real-time log streaming
6. Copies results back via SCP
### Choosing the right setup
| Scenario | Recommendation |
|----------|---------------|
| Model fits on 1 GPU | Use 1 GPU. Adding more won't help and may slow things down. |
| Model fits on 1 GPU, PROBE is slow (many prompts) | Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations. |
| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. E.g., a 70B model in fp16 (~140 GB) needs 2x A100-80GB — don't use 4. |
| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. |
| No local GPUs | Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab. |
## 10 study presets
Pre-configured ablation studies you can run out of the box: