From 79b469d3dcab86e44cb2775c6b4ebe3197d7e746 Mon Sep 17 00:00:00 2001
From: Stella Biderman <stellabiderman@gmail.com>
Date: Mon, 16 Mar 2026 15:29:57 -0400
Subject: [PATCH] Add DeepSeek-R1-Distill-Llama-70B pipeline parallel
 benchmarks

Benchmarked 70B dense model (149 GB bf16) on 2/3/4/8 A100-80GB GPUs.
3 GPUs was fastest (536s), confirming minimum-viable-GPU-count guidance.
Combined stage breakdown table for both models.
---
 README.md | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/README.md b/README.md
index 008cde1..275be60 100644
--- a/README.md
+++ b/README.md
@@ -444,7 +444,9 @@ This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then shar
 
 ### Pipeline parallel benchmarks
 
-We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
+We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.
+
+**GPT-OSS-120B** (117B MoE, ~234 GB in bf16):
 
 | GPUs | Total time | VRAM/GPU | Notes |
 |------|-----------|----------|-------|
@@ -454,21 +456,30 @@ We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE
 | 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
 | 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |
 
+**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers):
+
+| GPUs | Total time | VRAM/GPU | Notes |
+|------|-----------|----------|-------|
+| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom |
+| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model |
+| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs |
+| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 |
+
 Stage breakdown (approximately constant across GPU counts):
 
-| Stage | Time | Bottleneck |
-|-------|------|-----------|
-| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
-| PROBE (activations) | ~20s | Forward passes through sharded model |
-| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
-| VERIFY | ~210s | Forward passes on validation prompts |
-| REBIRTH (save) | ~350s | Writing 234 GB to disk |
+| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck |
+|-------|-------------|-------------|-----------|
+| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) |
+| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model |
+| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) |
+| VERIFY | ~210s | ~270s | Forward passes on validation prompts |
+| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) |
 
 Key findings:
 
-- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
+- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B.
 - **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
-- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
+- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model.
 - **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.
 
 ### When you actually need data parallelism