From c723da02c84194f14078675d2b777996076ff2df Mon Sep 17 00:00:00 2001
From: Stella Biderman <stellabiderman@gmail.com>
Date: Mon, 16 Mar 2026 14:39:22 -0400
Subject: [PATCH] Document multi-GPU parallelism, benchmarks, and remote SSH
 execution

Add a comprehensive section covering:
- How model sharding (pipeline parallelism) works and its limitations
- GPU selection via --gpus flag
- Pipeline parallel benchmarks on GPT-OSS-120B across 3-8 A100-80GB GPUs
- Stage-by-stage timing breakdown
- When data parallelism helps (and when it doesn't)
- Remote SSH execution with CLI and YAML examples
- Decision table for choosing the right setup
---
 README.md | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/README.md b/README.md
index 045ae5f..008cde1 100644
--- a/README.md
+++ b/README.md
@@ -415,6 +415,138 @@ Includes pre-liberated variants (Dolphin, Hermes, WhiteRabbitNeo) for A/B compar
 obliteratus models
 ```
 
+## Multi-GPU and remote execution
+
+OBLITERATUS automatically shards models across multiple GPUs when they don't fit on a single card. It also supports remote execution over SSH, so you can run the pipeline on a GPU server from your laptop.
+
+### How model sharding works
+
+When you have multiple GPUs, OBLITERATUS uses accelerate's `device_map="auto"` to split the model's layers across all available GPUs. This is **naive pipeline parallelism** — layers are distributed evenly, but only one GPU computes at a time as activations flow sequentially through the layer stack. The other GPUs hold their assigned layers in memory but are idle until their turn.
+
+This means multi-GPU sharding is a **memory solution, not a speed solution**. It lets you run models that don't fit on one GPU, but it won't make small models run faster. In fact, more GPUs can be *slower* due to inter-GPU data transfer overhead at layer boundaries.
+
+### Selecting GPUs
+
+Use `--gpus` to control which GPUs are used:
+
+```bash
+# Use all 8 GPUs (default)
+obliteratus obliterate bigmodel/200B --gpus all
+
+# Use only GPUs 0-3
+obliteratus obliterate bigmodel/200B --gpus 0,1,2,3
+
+# Use a specific pair
+obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5
+```
+
+This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then sharded across the selected GPUs.
+
+### Pipeline parallel benchmarks
+
+We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
+
+| GPUs | Total time | VRAM/GPU | Notes |
+|------|-----------|----------|-------|
+| 3 | **FAILED** | ~78 GB | Not enough headroom for activations; some layers offloaded to CPU as meta tensors, crashes during EXCISE |
+| 4 | **615s** (10m15s) | ~58 GB | Fastest. Fewest inter-GPU transfers. Snapshot auto-skipped (insufficient free VRAM) |
+| 5 | 763s (12m43s) | ~47 GB | +24% slower than 4 GPUs |
+| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
+| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |
+
+Stage breakdown (approximately constant across GPU counts):
+
+| Stage | Time | Bottleneck |
+|-------|------|-----------|
+| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
+| PROBE (activations) | ~20s | Forward passes through sharded model |
+| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
+| VERIFY | ~210s | Forward passes on validation prompts |
+| REBIRTH (save) | ~350s | Writing 234 GB to disk |
+
+Key findings:
+
+- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
+- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
+- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
+- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.
+
+### When you actually need data parallelism
+
+For models that fit on a single GPU with room to spare, the PROBE stage (which runs 1024 forward passes to collect activations) is the main computational bottleneck. Pipeline parallelism doesn't help here — it still processes one prompt at a time through the full layer stack.
+
+True data parallelism (replicating the model and splitting prompts across GPUs) can speed up PROBE, but it requires enough VRAM to hold a full copy of the model on each GPU. An experimental pre-replicated data parallel implementation is available on the `data-parallel-prereplication` branch:
+
+```bash
+git checkout data-parallel-prereplication
+obliteratus obliterate EleutherAI/pythia-12b --data-parallel
+```
+
+This deep-copies the model to each GPU once, then distributes prompt batches across replicas using a thread pool. Benchmarks on Pythia 12B (24 GB model, 8x A100-80GB):
+
+| Mode | PROBE time | Notes |
+|------|-----------|-------|
+| Single GPU | 7.1s | Baseline |
+| Pre-replicated DP (8 GPUs) | 7.7s | Near parity — PROBE is too fast at this scale for parallelism to help |
+
+Data parallelism becomes more valuable as the prompt count or model size increases relative to the per-forward-pass cost. For most models, the overhead of replication exceeds the time saved.
+
+### Remote execution over SSH
+
+Run the full pipeline on a remote GPU node from your local machine. OBLITERATUS handles SSH connection, auto-installs itself on the remote if needed, streams logs in real time, and copies results back when done.
+
+```bash
+# Basic remote run
+obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
+    --remote user@gpu-node
+
+# With SSH key and custom options
+obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
+    --remote root@10.0.0.5 \
+    --ssh-key ~/.ssh/id_rsa \
+    --ssh-port 2222 \
+    --remote-dir /data/obliteratus \
+    --remote-python python3.11
+
+# Don't copy results back (keep on remote only)
+obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
+    --remote user@gpu-node --no-sync
+```
+
+Remote execution also works with `obliteratus run` (YAML configs) and `obliteratus tourney` (method comparison). You can specify remote settings in YAML:
+
+```yaml
+model:
+  name: meta-llama/Llama-3.1-70B-Instruct
+  dtype: float16
+
+remote:
+  host: gpu-node
+  user: root
+  ssh_key: ~/.ssh/id_rsa
+  remote_dir: /tmp/obliteratus_run
+  gpus: "0,1,2,3"     # select GPUs on the remote
+  sync_results: true   # copy results back when done
+```
+
+The remote runner:
+1. Tests SSH connectivity
+2. Detects GPUs on the remote (`nvidia-smi`)
+3. Installs obliteratus if not already present
+4. Uploads config files if using `obliteratus run`
+5. Runs the pipeline with real-time log streaming
+6. Copies results back via SCP
+
+### Choosing the right setup
+
+| Scenario | Recommendation |
+|----------|---------------|
+| Model fits on 1 GPU | Use 1 GPU. Adding more won't help and may slow things down. |
+| Model fits on 1 GPU, PROBE is slow (many prompts) | Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations. |
+| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. E.g., a 70B model in fp16 (~140 GB) needs 2x A100-80GB — don't use 4. |
+| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. |
+| No local GPUs | Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab. |
+
 ## 10 study presets
 
 Pre-configured ablation studies you can run out of the box: