mirror of
https://github.com/elder-plinius/OBLITERATUS.git
synced 2026-05-12 20:02:52 +02:00
Add DeepSeek-R1-Distill-Llama-70B pipeline parallel benchmarks
Benchmarked 70B dense model (149 GB bf16) on 2/3/4/8 A100-80GB GPUs. 3 GPUs was fastest (536s), confirming minimum-viable-GPU-count guidance. Combined stage breakdown table for both models.
This commit is contained in:
@@ -444,7 +444,9 @@ This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then shar
|
||||
|
||||
### Pipeline parallel benchmarks
|
||||
|
||||
We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE, ~234 GB in bf16) across varying numbers of A100-80GB GPUs:
|
||||
We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.
|
||||
|
||||
**GPT-OSS-120B** (117B MoE, ~234 GB in bf16):
|
||||
|
||||
| GPUs | Total time | VRAM/GPU | Notes |
|
||||
|------|-----------|----------|-------|
|
||||
@@ -454,21 +456,30 @@ We benchmarked the full abliteration pipeline on `openai/gpt-oss-120b` (117B MoE
|
||||
| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs |
|
||||
| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) |
|
||||
|
||||
**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers):
|
||||
|
||||
| GPUs | Total time | VRAM/GPU | Notes |
|
||||
|------|-----------|----------|-------|
|
||||
| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom |
|
||||
| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model |
|
||||
| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs |
|
||||
| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 |
|
||||
|
||||
Stage breakdown (approximately constant across GPU counts):
|
||||
|
||||
| Stage | Time | Bottleneck |
|
||||
|-------|------|-----------|
|
||||
| SUMMON (load) | ~11s | Disk I/O (model cached locally) |
|
||||
| PROBE (activations) | ~20s | Forward passes through sharded model |
|
||||
| DISTILL + EXCISE | ~30s | SVD + weight projection (CPU-bound) |
|
||||
| VERIFY | ~210s | Forward passes on validation prompts |
|
||||
| REBIRTH (save) | ~350s | Writing 234 GB to disk |
|
||||
| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck |
|
||||
|-------|-------------|-------------|-----------|
|
||||
| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) |
|
||||
| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model |
|
||||
| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) |
|
||||
| VERIFY | ~210s | ~270s | Forward passes on validation prompts |
|
||||
| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) |
|
||||
|
||||
Key findings:
|
||||
|
||||
- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B.
|
||||
- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B.
|
||||
- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
|
||||
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model.
|
||||
- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model.
|
||||
- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.
|
||||
|
||||
### When you actually need data parallelism
|
||||
|
||||
Reference in New Issue
Block a user