From 79b469d3dcab86e44cb2775c6b4ebe3197d7e746 Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Mon, 16 Mar 2026 15:29:57 -0400 Subject: [PATCH] Add DeepSeek-R1-Distill-Llama-70B pipeline parallel benchmarks Benchmarked 70B dense model (149 GB bf16) on 2/3/4/8 A100-80GB GPUs. 3 GPUs was fastest (536s), confirming minimum-viable-GPU-count guidance. Combined stage breakdown table for both models. --- README.md | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 008cde1..275be60 100644 --- a/README.md +++ b/README.md @@ -444,7 +444,9 @@ This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then shar ### Pipeline parallel benchmarks -We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs: +We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models. + +**GPT-OSS-120B** (117B MoE, ~234 GB in bf16): | GPUs | Total time | VRAM/GPU | Notes | |------|-----------|----------|-------| @@ -454,21 +456,30 @@ We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE | 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs | | 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) | +**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers): + +| GPUs | Total time | VRAM/GPU | Notes | +|------|-----------|----------|-------| +| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom | +| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model | +| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs | +| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 | + Stage breakdown (approximately constant across GPU counts): -| Stage | Time | Bottleneck | -|-------|------|-----------| -| SUMMON (load) | ~11s | Disk I/O (model cached locally) | -| PROBE (activations) | ~20s | Forward passes through sharded model | -| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) | -| VERIFY | ~210s | Forward passes on validation prompts | -| REBIRTH (save) | ~350s | Writing 234 GB to disk | +| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck | +|-------|-------------|-------------|-----------| +| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) | +| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model | +| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) | +| VERIFY | ~210s | ~270s | Forward passes on validation prompts | +| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) | Key findings: -- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B. +- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B. - **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count. -- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model. +- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model. - **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models. ### When you actually need data parallelism