Add DeepSeek-R1-Distill-Llama-70B pipeline parallel benchmarks

Benchmarked 70B dense model (149 GB bf16) on 2/3/4/8 A100-80GB GPUs. 3 GPUs was fastest (536s), confirming minimum-viable-GPU-count guidance. Combined stage breakdown table for both models.
2026-05-12 20:02:52 +02:00 · 2026-03-16 15:29:57 -04:00
parent c723da02c8
commit 79b469d3dc
1 changed files with 21 additions and 10 deletions
@@ -444,7 +444,9 @@ This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then shar

 ### Pipeline parallel benchmarks

-We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
+We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.
+
+**GPT-OSS-120B** (117B MoE, ~234 GB in bf16):

 | GPUs | Total time | VRAM/GPU | Notes |
 |------|-----------|----------|-------|
@@ -454,21 +456,30 @@ We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE
 | 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
 | 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |

+**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers):
+
+| GPUs | Total time | VRAM/GPU | Notes |
+|------|-----------|----------|-------|
+| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom |
+| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model |
+| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs |
+| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 |
+
 Stage breakdown (approximately constant across GPU counts):

-| Stage | Time | Bottleneck |
-|-------|------|-----------|
-| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
-| PROBE (activations) | ~20s | Forward passes through sharded model |
-| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
-| VERIFY | ~210s | Forward passes on validation prompts |
-| REBIRTH (save) | ~350s | Writing 234 GB to disk |
+| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck |
+|-------|-------------|-------------|-----------|
+| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) |
+| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model |
+| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) |
+| VERIFY | ~210s | ~270s | Forward passes on validation prompts |
+| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) |

 Key findings:

- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
+- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B.
 - **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
+- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model.
 - **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.

 ### When you actually need data parallelism