Add gpu-calc command and document precision/quantization options

New `obliteratus gpu-calc` subcommand estimates minimum GPU count from
model params, dtype, and GPU VRAM. Auto-detects param counts from HF
configs including MoE expert structure.

README now covers --dtype, --quantization flags, the gpu-calc command,
and references both in the "Choosing the right setup" table.
This commit is contained in:
Stella Biderman
2026-03-17 14:01:18 -04:00
parent 79b469d3dc
commit 501ff0c963
2 changed files with 287 additions and 3 deletions
+47 -2
View File
@@ -442,6 +442,49 @@ obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5
This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then sharded across the selected GPUs.
### Precision and quantization
The `--dtype` flag controls the precision of model weights, which directly determines how much VRAM you need. Lower precision means smaller memory footprint at the cost of some numerical fidelity:
| Dtype | Bytes/param | 7B model | 70B model | 405B model |
|-------|-----------|---------|----------|-----------|
| `float32` | 4 | 28 GB | 280 GB | 1620 GB |
| `float16` / `bfloat16` | 2 | 14 GB | 140 GB | 810 GB |
| `int8` (via `--quantization bitsandbytes-8bit`) | 1 | 7 GB | 70 GB | 405 GB |
| `int4` (via `--quantization bitsandbytes-4bit`) | 0.5 | 3.5 GB | 35 GB | 203 GB |
```bash
# Default: bfloat16
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct
# 8-bit quantization — fits on fewer GPUs
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
--quantization bitsandbytes-8bit
# 4-bit quantization — Llama-405B on 4x A100-80GB
obliteratus obliterate meta-llama/Llama-3.1-405B-Instruct \
--quantization bitsandbytes-4bit --dtype float16
```
Quantization roughly halves the GPU count at each step down. A 70B model that needs 3x A100-80GB in bf16 fits on 2 in int8 or 1 in int4.
### GPU calculator
Not sure how many GPUs you need? The `gpu-calc` command estimates the minimum GPU count for any model, accounting for weight memory, activation overhead, and CUDA context:
```bash
# Auto-detect from HuggingFace model name
obliteratus gpu-calc meta-llama/Llama-3.1-70B-Instruct --gpu-mem 24
# Manual: specify params and precision
obliteratus gpu-calc --params 70 --dtype bfloat16 --gpu-mem 80
# MoE models: specify active params separately
obliteratus gpu-calc --params 117 --active-params 13 --dtype bfloat16 --gpu-mem 80
```
The calculator fetches the model config from HuggingFace to estimate parameter counts (including MoE expert structure), then shows a table of GPU configurations with headroom estimates. For MoE models, activation overhead is computed from the active parameter count rather than total parameters.
### Pipeline parallel benchmarks
We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.
@@ -553,9 +596,11 @@ The remote runner:
| Scenario | Recommendation |
|----------|---------------|
| Model fits on 1 GPU | Use 1 GPU. Adding more won't help and may slow things down. |
| Model almost fits on 1 GPU | Try `--quantization bitsandbytes-8bit` or `bitsandbytes-4bit` to reduce memory. Halving precision roughly halves VRAM. |
| Model fits on 1 GPU, PROBE is slow (many prompts) | Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations. |
| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. E.g., a 70B model in fp16 (~140 GB) needs 2x A100-80GB — don't use 4. |
| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. |
| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. Run `obliteratus gpu-calc` to find that number. |
| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. Consider quantization first — int4 can cut the GPU count by 4x. |
| Not sure how many GPUs you need | Run `obliteratus gpu-calc <model> --gpu-mem <your_vram>` for an estimate. |
| No local GPUs | Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab. |
## 10 study presets
+240 -1
View File
@@ -266,12 +266,42 @@ def main(argv: list[str] | None = None):
help="Also show global cross-architecture insights",
)
# --- gpu-calc ---
calc_parser = subparsers.add_parser(
"gpu-calc",
help="Estimate minimum GPUs needed for a model",
)
calc_parser.add_argument(
"model", type=str, nargs="?", default=None,
help="HuggingFace model name/path (auto-fetches param counts)",
)
calc_parser.add_argument(
"--params", type=float, default=None, metavar="B",
help="Total parameters in billions (overrides auto-detection)",
)
calc_parser.add_argument(
"--active-params", type=float, default=None, metavar="B",
help="Active parameters in billions (for MoE models; defaults to --params)",
)
calc_parser.add_argument(
"--dtype", type=str, default="bfloat16",
choices=["float32", "float16", "bfloat16", "int8", "int4"],
help="Data type for model weights (default: bfloat16)",
)
calc_parser.add_argument(
"--gpu-mem", type=float, default=80.0, metavar="GB",
help="VRAM per GPU in GB (default: 80 for A100-80GB)",
)
args = parser.parse_args(argv)
# Apply GPU selection early (before any CUDA init)
_apply_gpu_selection(args)
if args.command == "run":
if args.command == "gpu-calc":
_cmd_gpu_calc(args)
return
elif args.command == "run":
if getattr(args, "remote", None):
_cmd_remote_run(args)
else:
@@ -765,6 +795,215 @@ def _cmd_abliterate(args):
)
def _cmd_gpu_calc(args):
import math
from rich.panel import Panel
from rich.table import Table
BYTES_PER_PARAM = {
"float32": 4,
"float16": 2,
"bfloat16": 2,
"int8": 1,
"int4": 0.5,
}
# Resolve param counts
total_params_b = args.params
active_params_b = args.active_params
if total_params_b is None:
if args.model is None:
console.print("[red]Provide either a model name or --params.[/]")
raise SystemExit(1)
console.print(f"Fetching config for [cyan]{args.model}[/]...")
try:
from transformers import AutoConfig
config = AutoConfig.from_pretrained(args.model, trust_remote_code=True)
except Exception as e:
console.print(f"[red]Could not load config: {e}[/]")
raise SystemExit(1)
# Total params: prefer explicit num_parameters, else estimate from config
total_params_b = _estimate_total_params_b(config)
# Active params for MoE
if active_params_b is None:
active_params_b = _estimate_active_params_b(config, total_params_b)
if active_params_b is None:
active_params_b = total_params_b
bpp = BYTES_PER_PARAM[args.dtype]
gpu_mem_gb = args.gpu_mem
# Model weight memory (use base-10 GB to match HF/nvidia conventions)
weight_gb = total_params_b * bpp
# Activation overhead during forward passes (PROBE/VERIFY).
# Scales with active params, not total. Empirical from benchmarks:
# - DeepSeek-70B (149GB): failed at 160GB (2 GPUs), OK at 240GB (3 GPUs)
# - GPT-OSS-120B (234GB): failed at 240GB (3 GPUs), OK at 320GB (4 GPUs)
# This implies ~15-35% overhead. We use 20% as a reasonable middle ground.
active_weight_gb = active_params_b * bpp
activation_overhead_gb = active_weight_gb * 0.20
# CUDA context + fragmentation overhead: ~1.5 GB per GPU (fixed cost)
cuda_overhead_per_gpu = 1.5
# Total memory needed (before splitting across GPUs)
total_needed_gb = weight_gb + activation_overhead_gb
# Find minimum GPUs: we need total_needed / (gpu_mem - cuda_overhead) GPUs
usable_per_gpu = gpu_mem_gb - cuda_overhead_per_gpu
if usable_per_gpu <= 0:
console.print("[red]GPU memory too small after CUDA overhead.[/]")
raise SystemExit(1)
min_gpus = math.ceil(total_needed_gb / usable_per_gpu)
min_gpus = max(min_gpus, 1)
# Show results for a range of GPU counts
is_moe = active_params_b < total_params_b * 0.99
table = Table(title="GPU Configurations", show_edge=True)
table.add_column("GPUs", justify="right", style="cyan")
table.add_column("VRAM/GPU", justify="right")
table.add_column("Total VRAM", justify="right")
table.add_column("Headroom", justify="right")
table.add_column("Verdict", min_width=20)
# Show from min_gpus-1 (to show why it fails) up to 8
low = max(1, min_gpus - 1)
high = max(min_gpus + 3, 8)
for n in range(low, high + 1):
total_vram = n * gpu_mem_gb
usable_vram = n * usable_per_gpu
headroom = usable_vram - total_needed_gb
headroom_pct = headroom / total_needed_gb * 100
vram_per = total_needed_gb / n
if headroom < 0:
verdict = "[red]INSUFFICIENT[/]"
elif headroom_pct < 15:
verdict = "[yellow]TIGHT — may fail[/]"
elif n == min_gpus:
verdict = "[bold green]MINIMUM (recommended)[/]"
else:
verdict = "[green]OK[/] [dim](more GPUs = slower)[/]"
table.add_row(
str(n),
f"{vram_per:.1f} GB",
f"{total_vram:.0f} GB",
f"{headroom:+.1f} GB ({headroom_pct:+.0f}%)",
verdict,
)
model_label = args.model or f"{total_params_b:.1f}B params"
moe_line = ""
if is_moe:
moe_line = f"\n Active params: [cyan]{active_params_b:.1f}B[/] ({active_params_b/total_params_b*100:.0f}% of total — MoE)"
console.print(Panel(
f" Model: [cyan]{model_label}[/]\n"
f" Total params: [cyan]{total_params_b:.1f}B[/]"
f"{moe_line}\n"
f" Dtype: [cyan]{args.dtype}[/] ({bpp} bytes/param)\n"
f" Weight memory: [cyan]{weight_gb:.1f} GB[/]\n"
f" Activation est: [cyan]{activation_overhead_gb:.1f} GB[/]\n"
f" Total needed: [bold]{total_needed_gb:.1f} GB[/]\n"
f" GPU VRAM: [cyan]{gpu_mem_gb:.0f} GB[/] per device",
title="[bold]GPU Calculator[/]",
border_style="cyan",
))
console.print(table)
console.print(
f"\n [bold green]Minimum GPUs: {min_gpus}[/]"
f" ({min_gpus} x {gpu_mem_gb:.0f} GB = {min_gpus * gpu_mem_gb:.0f} GB)\n"
)
console.print(
"[dim]Note: fewer GPUs = faster (pipeline parallel has cross-device overhead).\n"
"Estimates are conservative. Actual memory may vary with sequence length\n"
"and model architecture. See 'obliteratus obliterate --help' for runtime options.[/]\n"
)
def _estimate_total_params_b(config) -> float:
"""Estimate total parameter count in billions from a HuggingFace config."""
# Some configs have explicit param counts
for attr in ("num_parameters", "n_params"):
val = getattr(config, attr, None)
if val and val > 1000:
return val / 1e9
# Estimate from architecture dimensions
h = getattr(config, "hidden_size", 0)
L = getattr(config, "num_hidden_layers", 0)
V = getattr(config, "vocab_size", 0)
i = getattr(config, "intermediate_size", h * 4)
if h == 0 or L == 0:
console.print("[red]Cannot determine model size from config. Use --params.[/]")
raise SystemExit(1)
n_heads = getattr(config, "num_attention_heads", None) or (h // 128)
head_dim = getattr(config, "head_dim", None) or (h // n_heads if n_heads else 128)
kv_heads = getattr(config, "num_key_value_heads", None) or n_heads
# Attention: Q + K + V projections + output projection
attn_params = h * (n_heads * head_dim) + h * (kv_heads * head_dim) * 2 + (n_heads * head_dim) * h
# FFN (MoE or dense)
n_experts = getattr(config, "num_local_experts", getattr(config, "num_experts", 1)) or 1
# MoE models often have a separate intermediate size for expert FFNs
moe_i = getattr(config, "moe_intermediate_size", i)
# gate + up + down projections per expert
ffn_per_expert = h * moe_i * 3
ffn_params = ffn_per_expert * n_experts
# Some architectures (Qwen, DeepSeek) also have a shared/dense FFN per layer
if n_experts > 1 and hasattr(config, "moe_intermediate_size"):
# The dense FFN uses the main intermediate_size
ffn_params += h * i * 3
# Router
if n_experts > 1:
ffn_params += h * n_experts
# Per-layer: attention + FFN + layernorms
layer_params = attn_params + ffn_params + h * 4 # 2 layernorms, 2 params each
# Embedding + LM head
embed_params = V * h * 2 # input + output embeddings (may be tied but counts for memory)
total = L * layer_params + embed_params
return total / 1e9
def _estimate_active_params_b(config, total_params_b: float) -> float:
"""For MoE models, estimate active parameters per forward pass."""
n_experts = getattr(config, "num_local_experts", getattr(config, "num_experts", 1)) or 1
if n_experts <= 1:
return total_params_b
top_k = getattr(config, "num_experts_per_tok", getattr(config, "top_k", 2)) or 2
h = getattr(config, "hidden_size", 0)
i = getattr(config, "intermediate_size", h * 4)
moe_i = getattr(config, "moe_intermediate_size", i)
L = getattr(config, "num_hidden_layers", 0)
# FFN per expert (uses moe_intermediate_size if available)
ffn_per_expert = h * moe_i * 3
# Active FFN = top_k experts instead of all n_experts
ffn_all = ffn_per_expert * n_experts * L
ffn_active = ffn_per_expert * top_k * L
# Non-FFN params (includes any shared/dense FFN)
non_ffn = total_params_b * 1e9 - ffn_all
active = non_ffn + ffn_active
return max(active / 1e9, 0.1)
def _make_remote_runner(args):
"""Create a RemoteRunner from CLI --remote flags."""
from obliteratus.remote import RemoteConfig, RemoteRunner