From c723da02c84194f14078675d2b777996076ff2df Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Mon, 16 Mar 2026 14:39:22 -0400 Subject: [PATCH] Document multi-GPU parallelism, benchmarks, and remote SSH execution Add a comprehensive section covering: - How model sharding (pipeline parallelism) works and its limitations - GPU selection via --gpus flag - Pipeline parallel benchmarks on GPT-OSS-120B across 3-8 A100-80GB GPUs - Stage-by-stage timing breakdown - When data parallelism helps (and when it doesn't) - Remote SSH execution with CLI and YAML examples - Decision table for choosing the right setup --- README.md | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) diff --git a/README.md b/README.md index 045ae5f..008cde1 100644 --- a/README.md +++ b/README.md @@ -415,6 +415,138 @@ Includes pre-liberated variants (Dolphin, Hermes, WhiteRabbitNeo) for A/B compar obliteratus models ``` +## Multi-GPU and remote execution + +OBLITERATUS automatically shards models across multiple GPUs when they don't fit on a single card. It also supports remote execution over SSH, so you can run the pipeline on a GPU server from your laptop. + +### How model sharding works + +When you have multiple GPUs, OBLITERATUS uses accelerate's `device_map="auto"` to split the model's layers across all available GPUs. This is **naive pipeline parallelism** — layers are distributed evenly, but only one GPU computes at a time as activations flow sequentially through the layer stack. The other GPUs hold their assigned layers in memory but are idle until their turn. + +This means multi-GPU sharding is a **memory solution, not a speed solution**. It lets you run models that don't fit on one GPU, but it won't make small models run faster. In fact, more GPUs can be *slower* due to inter-GPU data transfer overhead at layer boundaries. + +### Selecting GPUs + +Use `--gpus` to control which GPUs are used: + +```bash +# Use all 8 GPUs (default) +obliteratus obliterate bigmodel/200B --gpus all + +# Use only GPUs 0-3 +obliteratus obliterate bigmodel/200B --gpus 0,1,2,3 + +# Use a specific pair +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5 +``` + +This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then sharded across the selected GPUs. + +### Pipeline parallel benchmarks + +We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs: + +| GPUs | Total time | VRAM/GPU | Notes | +|------|-----------|----------|-------| +| 3 | **FAILED** | ~78 GB | Not enough headroom for activations; some layers offloaded to CPU as meta tensors, crashes during EXCISE | +| 4 | **615s** (10m15s) | ~58 GB | Fastest. Fewest inter-GPU transfers. Snapshot auto-skipped (insufficient free VRAM) | +| 5 | 763s (12m43s) | ~47 GB | +24% slower than 4 GPUs | +| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs | +| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) | + +Stage breakdown (approximately constant across GPU counts): + +| Stage | Time | Bottleneck | +|-------|------|-----------| +| SUMMON (load) | ~11s | Disk I/O (model cached locally) | +| PROBE (activations) | ~20s | Forward passes through sharded model | +| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) | +| VERIFY | ~210s | Forward passes on validation prompts | +| REBIRTH (save) | ~350s | Writing 234 GB to disk | + +Key findings: + +- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B. +- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count. +- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model. +- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models. + +### When you actually need data parallelism + +For models that fit on a single GPU with room to spare, the PROBE stage (which runs 1024 forward passes to collect activations) is the main computational bottleneck. Pipeline parallelism doesn't help here — it still processes one prompt at a time through the full layer stack. + +True data parallelism (replicating the model and splitting prompts across GPUs) can speed up PROBE, but it requires enough VRAM to hold a full copy of the model on each GPU. An experimental pre-replicated data parallel implementation is available on the `data-parallel-prereplication` branch: + +```bash +git checkout data-parallel-prereplication +obliteratus obliterate EleutherAI/pythia-12b --data-parallel +``` + +This deep-copies the model to each GPU once, then distributes prompt batches across replicas using a thread pool. Benchmarks on Pythia 12B (24 GB model, 8x A100-80GB): + +| Mode | PROBE time | Notes | +|------|-----------|-------| +| Single GPU | 7.1s | Baseline | +| Pre-replicated DP (8 GPUs) | 7.7s | Near parity — PROBE is too fast at this scale for parallelism to help | + +Data parallelism becomes more valuable as the prompt count or model size increases relative to the per-forward-pass cost. For most models, the overhead of replication exceeds the time saved. + +### Remote execution over SSH + +Run the full pipeline on a remote GPU node from your local machine. OBLITERATUS handles SSH connection, auto-installs itself on the remote if needed, streams logs in real time, and copies results back when done. + +```bash +# Basic remote run +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote user@gpu-node + +# With SSH key and custom options +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote root@10.0.0.5 \ + --ssh-key ~/.ssh/id_rsa \ + --ssh-port 2222 \ + --remote-dir /data/obliteratus \ + --remote-python python3.11 + +# Don't copy results back (keep on remote only) +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote user@gpu-node --no-sync +``` + +Remote execution also works with `obliteratus run` (YAML configs) and `obliteratus tourney` (method comparison). You can specify remote settings in YAML: + +```yaml +model: + name: meta-llama/Llama-3.1-70B-Instruct + dtype: float16 + +remote: + host: gpu-node + user: root + ssh_key: ~/.ssh/id_rsa + remote_dir: /tmp/obliteratus_run + gpus: "0,1,2,3" # select GPUs on the remote + sync_results: true # copy results back when done +``` + +The remote runner: +1. Tests SSH connectivity +2. Detects GPUs on the remote (`nvidia-smi`) +3. Installs obliteratus if not already present +4. Uploads config files if using `obliteratus run` +5. Runs the pipeline with real-time log streaming +6. Copies results back via SCP + +### Choosing the right setup + +| Scenario | Recommendation | +|----------|---------------| +| Model fits on 1 GPU | Use 1 GPU. Adding more won't help and may slow things down. | +| Model fits on 1 GPU, PROBE is slow (many prompts) | Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations. | +| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. E.g., a 70B model in fp16 (~140 GB) needs 2x A100-80GB — don't use 4. | +| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. | +| No local GPUs | Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab. | + ## 10 study presets Pre-configured ablation studies you can run out of the box: