Add DeepSeek-R1-Distill-Llama-70B pipeline parallel benchmarks

Benchmarked 70B dense model (149 GB bf16) on 2/3/4/8 A100-80GB GPUs.
3 GPUs was fastest (536s), confirming minimum-viable-GPU-count guidance.
Combined stage breakdown table for both models.
This commit is contained in:
Stella Biderman
2026-03-16 15:29:57 -04:00
parent c723da02c8
commit 79b469d3dc
+21 -10
View File
@@ -444,7 +444,9 @@ This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then shar
### Pipeline parallel benchmarks
We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.
**GPT-OSS-120B** (117B MoE, ~234 GB in bf16):
| GPUs | Total time | VRAM/GPU | Notes |
|------|-----------|----------|-------|
@@ -454,21 +456,30 @@ We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE
| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |
**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers):
| GPUs | Total time | VRAM/GPU | Notes |
|------|-----------|----------|-------|
| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom |
| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model |
| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs |
| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 |
Stage breakdown (approximately constant across GPU counts):
| Stage | Time | Bottleneck |
|-------|------|-----------|
| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
| PROBE (activations) | ~20s | Forward passes through sharded model |
| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
| VERIFY | ~210s | Forward passes on validation prompts |
| REBIRTH (save) | ~350s | Writing 234 GB to disk |
| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck |
|-------|-------------|-------------|-----------|
| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) |
| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model |
| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) |
| VERIFY | ~210s | ~270s | Forward passes on validation prompts |
| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) |
Key findings:
- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B.
- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model.
- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.
### When you actually need data parallelism