diff --git a/README.md b/README.md index 045ae5f..fa6654e 100644 --- a/README.md +++ b/README.md @@ -415,6 +415,194 @@ Includes pre-liberated variants (Dolphin, Hermes, WhiteRabbitNeo) for A/B compar obliteratus models ``` +## Multi-GPU and remote execution + +OBLITERATUS automatically shards models across multiple GPUs when they don't fit on a single card. It also supports remote execution over SSH, so you can run the pipeline on a GPU server from your laptop. + +### How model sharding works + +When you have multiple GPUs, OBLITERATUS uses accelerate's `device_map="auto"` to split the model's layers across all available GPUs. This is **naive pipeline parallelism** — layers are distributed evenly, but only one GPU computes at a time as activations flow sequentially through the layer stack. The other GPUs hold their assigned layers in memory but are idle until their turn. + +This means multi-GPU sharding is a **memory solution, not a speed solution**. It lets you run models that don't fit on one GPU, but it won't make small models run faster. In fact, more GPUs can be *slower* due to inter-GPU data transfer overhead at layer boundaries. + +### Selecting GPUs + +Use `--gpus` to control which GPUs are used: + +```bash +# Use all 8 GPUs (default) +obliteratus obliterate bigmodel/200B --gpus all + +# Use only GPUs 0-3 +obliteratus obliterate bigmodel/200B --gpus 0,1,2,3 + +# Use a specific pair +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5 +``` + +This sets `CUDA_VISIBLE_DEVICES` before CUDA initializes. The model is then sharded across the selected GPUs. + +### Precision and quantization + +The `--dtype` flag controls the precision of model weights, which directly determines how much VRAM you need. Lower precision means smaller memory footprint at the cost of some numerical fidelity: + +| Dtype | Bytes/param | 7B model | 70B model | 405B model | +|-------|-----------|---------|----------|-----------| +| `float32` | 4 | 28 GB | 280 GB | 1620 GB | +| `float16` / `bfloat16` | 2 | 14 GB | 140 GB | 810 GB | +| `int8` (via `--quantization bitsandbytes-8bit`) | 1 | 7 GB | 70 GB | 405 GB | +| `int4` (via `--quantization bitsandbytes-4bit`) | 0.5 | 3.5 GB | 35 GB | 203 GB | + +```bash +# Default: bfloat16 +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct + +# 8-bit quantization — fits on fewer GPUs +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --quantization bitsandbytes-8bit + +# 4-bit quantization — Llama-405B on 4x A100-80GB +obliteratus obliterate meta-llama/Llama-3.1-405B-Instruct \ + --quantization bitsandbytes-4bit --dtype float16 +``` + +Quantization roughly halves the GPU count at each step down. A 70B model that needs 3x A100-80GB in bf16 fits on 2 in int8 or 1 in int4. + +### GPU calculator + +Not sure how many GPUs you need? The `gpu-calc` command estimates the minimum GPU count for any model, accounting for weight memory, activation overhead, and CUDA context: + +```bash +# Auto-detect from HuggingFace model name +obliteratus gpu-calc meta-llama/Llama-3.1-70B-Instruct --gpu-mem 24 + +# Manual: specify params and precision +obliteratus gpu-calc --params 70 --dtype bfloat16 --gpu-mem 80 + +# MoE models: specify active params separately +obliteratus gpu-calc --params 117 --active-params 13 --dtype bfloat16 --gpu-mem 80 +``` + +The calculator fetches the model config from HuggingFace to estimate parameter counts (including MoE expert structure), then shows a table of GPU configurations with headroom estimates. For MoE models, activation overhead is computed from the active parameter count rather than total parameters. + +### Pipeline parallel benchmarks + +We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models. + +**GPT-OSS-120B** (117B MoE, ~234 GB in bf16): + +| GPUs | Total time | VRAM/GPU | Notes | +|------|-----------|----------|-------| +| 3 | **FAILED** | ~78 GB | Not enough headroom for activations; some layers offloaded to CPU as meta tensors, crashes during EXCISE | +| 4 | **615s** (10m15s) | ~58 GB | Fastest. Fewest inter-GPU transfers. Snapshot auto-skipped (insufficient free VRAM) | +| 5 | 763s (12m43s) | ~47 GB | +24% slower than 4 GPUs | +| 6 | 766s (12m46s) | ~39 GB | +25% slower than 4 GPUs | +| 8 | 633s (10m33s) | ~29 GB | +3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s) | + +**DeepSeek-R1-Distill-Llama-70B** (70B dense, ~149 GB in bf16, 80 layers): + +| GPUs | Total time | VRAM/GPU | Notes | +|------|-----------|----------|-------| +| 2 | **FAILED** | ~75 GB | Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom | +| 3 | **536s** (8m56s) | ~50 GB | Fastest. Minimum viable GPU count for this model | +| 4 | 626s (10m26s) | ~37 GB | +17% slower than 3 GPUs | +| 8 | 627s (10m27s) | ~19 GB | +17% slower than 3 GPUs. No benefit over 4 | + +Stage breakdown (approximately constant across GPU counts): + +| Stage | GPT-OSS-120B | DeepSeek-70B | Bottleneck | +|-------|-------------|-------------|-----------| +| SUMMON (load) | ~11s | ~24s | Disk I/O (model cached locally) | +| PROBE (activations) | ~20s | ~20s | Forward passes through sharded model | +| DISTILL + EXCISE | ~30s | ~30s | SVD + weight projection (CPU-bound) | +| VERIFY | ~210s | ~270s | Forward passes on validation prompts | +| REBIRTH (save) | ~350s | ~194s | Writing model to disk (234 GB vs 141 GB) | + +Key findings: + +- **Use the minimum number of GPUs that fits your model.** Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B. +- **The pipeline is I/O-dominated for large models.** VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count. +- **Leave headroom.** The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model. +- **Pipeline parallelism doesn't help compute-bound stages.** Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models. + +### When you actually need data parallelism + +For models that fit on a single GPU with room to spare, the PROBE stage (which runs 1024 forward passes to collect activations) is the main computational bottleneck. Pipeline parallelism doesn't help here — it still processes one prompt at a time through the full layer stack. + +True data parallelism (replicating the model and splitting prompts across GPUs) can speed up PROBE, but it requires enough VRAM to hold a full copy of the model on each GPU. An experimental pre-replicated data parallel implementation is available on the `data-parallel-prereplication` branch: + +```bash +git checkout data-parallel-prereplication +obliteratus obliterate EleutherAI/pythia-12b --data-parallel +``` + +This deep-copies the model to each GPU once, then distributes prompt batches across replicas using a thread pool. Benchmarks on Pythia 12B (24 GB model, 8x A100-80GB): + +| Mode | PROBE time | Notes | +|------|-----------|-------| +| Single GPU | 7.1s | Baseline | +| Pre-replicated DP (8 GPUs) | 7.7s | Near parity — PROBE is too fast at this scale for parallelism to help | + +Data parallelism becomes more valuable as the prompt count or model size increases relative to the per-forward-pass cost. For most models, the overhead of replication exceeds the time saved. + +### Remote execution over SSH + +Run the full pipeline on a remote GPU node from your local machine. OBLITERATUS handles SSH connection, auto-installs itself on the remote if needed, streams logs in real time, and copies results back when done. + +```bash +# Basic remote run +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote user@gpu-node + +# With SSH key and custom options +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote root@10.0.0.5 \ + --ssh-key ~/.ssh/id_rsa \ + --ssh-port 2222 \ + --remote-dir /data/obliteratus \ + --remote-python python3.11 + +# Don't copy results back (keep on remote only) +obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \ + --remote user@gpu-node --no-sync +``` + +Remote execution also works with `obliteratus run` (YAML configs) and `obliteratus tourney` (method comparison). You can specify remote settings in YAML: + +```yaml +model: + name: meta-llama/Llama-3.1-70B-Instruct + dtype: float16 + +remote: + host: gpu-node + user: root + ssh_key: ~/.ssh/id_rsa + remote_dir: /tmp/obliteratus_run + gpus: "0,1,2,3" # select GPUs on the remote + sync_results: true # copy results back when done +``` + +The remote runner: +1. Tests SSH connectivity +2. Detects GPUs on the remote (`nvidia-smi`) +3. Installs obliteratus if not already present +4. Uploads config files if using `obliteratus run` +5. Runs the pipeline with real-time log streaming +6. Copies results back via SCP + +### Choosing the right setup + +| Scenario | Recommendation | +|----------|---------------| +| Model fits on 1 GPU | Use 1 GPU. Adding more won't help and may slow things down. | +| Model almost fits on 1 GPU | Try `--quantization bitsandbytes-8bit` or `bitsandbytes-4bit` to reduce memory. Halving precision roughly halves VRAM. | +| Model fits on 1 GPU, PROBE is slow (many prompts) | Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations. | +| Model doesn't fit on 1 GPU | Use `--gpus` with the **minimum** number of GPUs that fits. Run `obliteratus gpu-calc` to find that number. | +| Model needs 4+ GPUs | Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. Consider quantization first — int4 can cut the GPU count by 4x. | +| Not sure how many GPUs you need | Run `obliteratus gpu-calc --gpu-mem ` for an estimate. | +| No local GPUs | Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab. | + ## 10 study presets Pre-configured ablation studies you can run out of the box: diff --git a/examples/remote_gpu_node.yaml b/examples/remote_gpu_node.yaml new file mode 100644 index 0000000..6d6951f --- /dev/null +++ b/examples/remote_gpu_node.yaml @@ -0,0 +1,47 @@ +# Example: Run an ablation study on a remote GPU node via SSH. +# +# Usage: +# obliteratus run examples/remote_gpu_node.yaml +# +# The 'remote' section tells Obliteratus to SSH into the specified host, +# install obliteratus if needed, run the pipeline there, and copy results +# back to the local machine. +# +# You can also use --remote on any command instead of a YAML section: +# obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --remote root@gpu-node --ssh-key ~/.ssh/id_rsa +# +# Multi-GPU: Models are automatically split across all available GPUs via +# accelerate's device_map="auto". Use --gpus or the gpus: field to select +# specific GPUs: +# obliteratus obliterate model --remote root@gpu-node --gpus 0,1,2,3 + +model: + name: meta-llama/Llama-3.1-8B-Instruct + task: causal_lm + dtype: float16 + device: auto + +dataset: + name: wikitext + split: test + max_samples: 500 + +strategies: + - name: layer_removal + params: + layer_indices: [10, 11, 12] + +metrics: [perplexity] +batch_size: 8 +max_length: 512 +output_dir: results/remote_example + +remote: + host: gpu-node.example.com + user: root + port: 22 + ssh_key: ~/.ssh/id_rsa + remote_dir: /tmp/obliteratus_run + python: python3 + sync_results: true + # gpus: "0,1,2,3" # uncomment to select specific GPUs (default: all) diff --git a/obliteratus/__init__.py b/obliteratus/__init__.py index 7f70058..2039325 100644 --- a/obliteratus/__init__.py +++ b/obliteratus/__init__.py @@ -17,6 +17,8 @@ __all__ = [ "TourneyResult", "get_adaptive_recommendation", "AdaptiveRecommendation", + "RemoteRunner", + "RemoteConfig", ] @@ -60,4 +62,10 @@ def __getattr__(name): if name == "AdaptiveRecommendation": from obliteratus.adaptive_defaults import AdaptiveRecommendation return AdaptiveRecommendation + if name == "RemoteRunner": + from obliteratus.remote import RemoteRunner + return RemoteRunner + if name == "RemoteConfig": + from obliteratus.remote import RemoteConfig + return RemoteConfig raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/obliteratus/__main__.py b/obliteratus/__main__.py new file mode 100644 index 0000000..995f990 --- /dev/null +++ b/obliteratus/__main__.py @@ -0,0 +1,5 @@ +"""Allow running obliteratus as ``python -m obliteratus``.""" + +from obliteratus.cli import main + +main() diff --git a/obliteratus/cli.py b/obliteratus/cli.py index e32fb25..b0767ab 100644 --- a/obliteratus/cli.py +++ b/obliteratus/cli.py @@ -22,6 +22,70 @@ _BANNER = r""" """ +def _add_gpu_args(parser): + """Add --gpus flag for multi-GPU control.""" + gpu_group = parser.add_argument_group("GPU selection") + gpu_group.add_argument( + "--gpus", type=str, default=None, metavar="IDS", + help=( + "Comma-separated GPU IDs to use (e.g. '0,1,2,3' or 'all'). " + "Sets CUDA_VISIBLE_DEVICES. By default uses all available GPUs. " + "Models are automatically split across selected GPUs via accelerate." + ), + ) + + +def _add_remote_args(parser): + """Add --remote execution flags to a subcommand parser.""" + remote_group = parser.add_argument_group("remote execution") + remote_group.add_argument( + "--remote", type=str, default=None, metavar="[USER@]HOST", + help="Run on a remote GPU node via SSH (e.g. root@gpu-node or just gpu-node)", + ) + remote_group.add_argument( + "--ssh-key", type=str, default=None, + help="Path to SSH private key (default: use SSH agent or ~/.ssh/id_rsa)", + ) + remote_group.add_argument( + "--ssh-port", type=int, default=22, + help="SSH port on remote host (default: 22)", + ) + remote_group.add_argument( + "--remote-dir", type=str, default="/tmp/obliteratus_run", + help="Working directory on the remote machine (default: /tmp/obliteratus_run)", + ) + remote_group.add_argument( + "--remote-python", type=str, default="python3", + help="Python binary on the remote machine (default: python3)", + ) + remote_group.add_argument( + "--no-sync", action="store_true", default=False, + help="Don't copy results back to local machine after remote run", + ) + + +def _apply_gpu_selection(args): + """Set CUDA_VISIBLE_DEVICES based on --gpus flag (for local runs only).""" + import os + + gpus = getattr(args, "gpus", None) + if gpus is None or getattr(args, "remote", None): + return # skip for remote runs (handled by remote runner) + + if gpus.lower() == "all": + return # use all GPUs (default behavior) + + # Validate: should be comma-separated integers + try: + gpu_ids = [int(g.strip()) for g in gpus.split(",")] + except ValueError: + console.print(f"[red]Invalid --gpus value: {gpus!r}. Expected comma-separated integers or 'all'.[/]") + raise SystemExit(1) + + os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(g) for g in gpu_ids) + console.print(f"[dim]Using GPUs: {gpu_ids} (CUDA_VISIBLE_DEVICES={os.environ['CUDA_VISIBLE_DEVICES']})[/dim]") + + def main(argv: list[str] | None = None): console.print(_BANNER) parser = argparse.ArgumentParser( @@ -40,6 +104,8 @@ def main(argv: list[str] | None = None): default=None, help="Apply a preset (e.g. quick, full, attention, jailbreak, guardrail)", ) + _add_gpu_args(run_parser) + _add_remote_args(run_parser) # --- info --- info_parser = subparsers.add_parser("info", help="Print model architecture info") @@ -144,9 +210,13 @@ def main(argv: list[str] | None = None): help="One-click: remove refusal directions from a model (SOTA multi-technique)", ) _add_obliterate_args(abl_parser) + _add_gpu_args(abl_parser) + _add_remote_args(abl_parser) # Backward-compat alias (hidden from help) abl_alias = subparsers.add_parser("abliterate", help=argparse.SUPPRESS) _add_obliterate_args(abl_alias) + _add_gpu_args(abl_alias) + _add_remote_args(abl_alias) # --- report --- report_parser = subparsers.add_parser("report", help="Regenerate report from saved results") @@ -180,6 +250,8 @@ def main(argv: list[str] | None = None): "--methods", type=str, nargs="+", default=None, help="Override: only run these methods (space-separated)", ) + _add_gpu_args(tourney_parser) + _add_remote_args(tourney_parser) # --- recommend --- recommend_parser = subparsers.add_parser( @@ -194,10 +266,46 @@ def main(argv: list[str] | None = None): help="Also show global cross-architecture insights", ) + # --- gpu-calc --- + calc_parser = subparsers.add_parser( + "gpu-calc", + help="Estimate minimum GPUs needed for a model", + ) + calc_parser.add_argument( + "model", type=str, nargs="?", default=None, + help="HuggingFace model name/path (auto-fetches param counts)", + ) + calc_parser.add_argument( + "--params", type=float, default=None, metavar="B", + help="Total parameters in billions (overrides auto-detection)", + ) + calc_parser.add_argument( + "--active-params", type=float, default=None, metavar="B", + help="Active parameters in billions (for MoE models; defaults to --params)", + ) + calc_parser.add_argument( + "--dtype", type=str, default="bfloat16", + choices=["float32", "float16", "bfloat16", "int8", "int4"], + help="Data type for model weights (default: bfloat16)", + ) + calc_parser.add_argument( + "--gpu-mem", type=float, default=80.0, metavar="GB", + help="VRAM per GPU in GB (default: 80 for A100-80GB)", + ) + args = parser.parse_args(argv) - if args.command == "run": - _cmd_run(args) + # Apply GPU selection early (before any CUDA init) + _apply_gpu_selection(args) + + if args.command == "gpu-calc": + _cmd_gpu_calc(args) + return + elif args.command == "run": + if getattr(args, "remote", None): + _cmd_remote_run(args) + else: + _cmd_run(args) elif args.command == "interactive": _cmd_interactive() elif args.command == "models": @@ -217,9 +325,15 @@ def main(argv: list[str] | None = None): elif args.command == "recommend": _cmd_recommend(args) elif args.command == "tourney": - _cmd_tourney(args) + if getattr(args, "remote", None): + _cmd_remote_tourney(args) + else: + _cmd_tourney(args) elif args.command in ("obliterate", "abliterate"): - _cmd_abliterate(args) + if getattr(args, "remote", None): + _cmd_remote_abliterate(args) + else: + _cmd_abliterate(args) def _cmd_ui(args): @@ -314,6 +428,34 @@ def _cmd_run(args): config = StudyConfig.from_dict(raw) if args.output_dir: config.output_dir = args.output_dir + + # If YAML has a remote: section, dispatch to remote runner + if config.remote is not None: + from obliteratus.remote import RemoteConfig as _RC, RemoteRunner + + rc = _RC( + host=config.remote.host, + user=config.remote.user, + port=config.remote.port, + ssh_key=config.remote.ssh_key, + remote_dir=config.remote.remote_dir, + python=config.remote.python, + sync_results=config.remote.sync_results, + gpus=config.remote.gpus, + ) + runner = RemoteRunner(rc) + result_path = runner.run_config( + local_config_path=args.config, + local_output_dir=config.output_dir, + preset=args.preset, + ) + if result_path: + console.print(f"\n[bold green]Remote run complete.[/] Results at: [cyan]{result_path}[/]") + else: + console.print("[red]Remote run failed. Check logs above.[/]") + raise SystemExit(1) + return + run_study(config) @@ -653,5 +795,321 @@ def _cmd_abliterate(args): ) +def _cmd_gpu_calc(args): + import math + + from rich.panel import Panel + from rich.table import Table + + BYTES_PER_PARAM = { + "float32": 4, + "float16": 2, + "bfloat16": 2, + "int8": 1, + "int4": 0.5, + } + + # Resolve param counts + total_params_b = args.params + active_params_b = args.active_params + + if total_params_b is None: + if args.model is None: + console.print("[red]Provide either a model name or --params.[/]") + raise SystemExit(1) + console.print(f"Fetching config for [cyan]{args.model}[/]...") + try: + from transformers import AutoConfig + config = AutoConfig.from_pretrained(args.model, trust_remote_code=True) + except Exception as e: + console.print(f"[red]Could not load config: {e}[/]") + raise SystemExit(1) + + # Total params: prefer explicit num_parameters, else estimate from config + total_params_b = _estimate_total_params_b(config) + + # Active params for MoE + if active_params_b is None: + active_params_b = _estimate_active_params_b(config, total_params_b) + + if active_params_b is None: + active_params_b = total_params_b + + bpp = BYTES_PER_PARAM[args.dtype] + gpu_mem_gb = args.gpu_mem + + # Model weight memory (use base-10 GB to match HF/nvidia conventions) + weight_gb = total_params_b * bpp + + # Activation overhead during forward passes (PROBE/VERIFY). + # Scales with active params, not total. Empirical from benchmarks: + # - DeepSeek-70B (149GB): failed at 160GB (2 GPUs), OK at 240GB (3 GPUs) + # - GPT-OSS-120B (234GB): failed at 240GB (3 GPUs), OK at 320GB (4 GPUs) + # This implies ~15-35% overhead. We use 20% as a reasonable middle ground. + active_weight_gb = active_params_b * bpp + activation_overhead_gb = active_weight_gb * 0.20 + + # CUDA context + fragmentation overhead: ~1.5 GB per GPU (fixed cost) + cuda_overhead_per_gpu = 1.5 + + # Total memory needed (before splitting across GPUs) + total_needed_gb = weight_gb + activation_overhead_gb + + # Find minimum GPUs: we need total_needed / (gpu_mem - cuda_overhead) GPUs + usable_per_gpu = gpu_mem_gb - cuda_overhead_per_gpu + if usable_per_gpu <= 0: + console.print("[red]GPU memory too small after CUDA overhead.[/]") + raise SystemExit(1) + + min_gpus = math.ceil(total_needed_gb / usable_per_gpu) + min_gpus = max(min_gpus, 1) + + # Show results for a range of GPU counts + is_moe = active_params_b < total_params_b * 0.99 + + table = Table(title="GPU Configurations", show_edge=True) + table.add_column("GPUs", justify="right", style="cyan") + table.add_column("VRAM/GPU", justify="right") + table.add_column("Total VRAM", justify="right") + table.add_column("Headroom", justify="right") + table.add_column("Verdict", min_width=20) + + # Show from min_gpus-1 (to show why it fails) up to 8 + low = max(1, min_gpus - 1) + high = max(min_gpus + 3, 8) + for n in range(low, high + 1): + total_vram = n * gpu_mem_gb + usable_vram = n * usable_per_gpu + headroom = usable_vram - total_needed_gb + headroom_pct = headroom / total_needed_gb * 100 + vram_per = total_needed_gb / n + + if headroom < 0: + verdict = "[red]INSUFFICIENT[/]" + elif headroom_pct < 15: + verdict = "[yellow]TIGHT — may fail[/]" + elif n == min_gpus: + verdict = "[bold green]MINIMUM (recommended)[/]" + else: + verdict = "[green]OK[/] [dim](more GPUs = slower)[/]" + + table.add_row( + str(n), + f"{vram_per:.1f} GB", + f"{total_vram:.0f} GB", + f"{headroom:+.1f} GB ({headroom_pct:+.0f}%)", + verdict, + ) + + model_label = args.model or f"{total_params_b:.1f}B params" + moe_line = "" + if is_moe: + moe_line = f"\n Active params: [cyan]{active_params_b:.1f}B[/] ({active_params_b/total_params_b*100:.0f}% of total — MoE)" + + console.print(Panel( + f" Model: [cyan]{model_label}[/]\n" + f" Total params: [cyan]{total_params_b:.1f}B[/]" + f"{moe_line}\n" + f" Dtype: [cyan]{args.dtype}[/] ({bpp} bytes/param)\n" + f" Weight memory: [cyan]{weight_gb:.1f} GB[/]\n" + f" Activation est: [cyan]{activation_overhead_gb:.1f} GB[/]\n" + f" Total needed: [bold]{total_needed_gb:.1f} GB[/]\n" + f" GPU VRAM: [cyan]{gpu_mem_gb:.0f} GB[/] per device", + title="[bold]GPU Calculator[/]", + border_style="cyan", + )) + console.print(table) + console.print( + f"\n [bold green]Minimum GPUs: {min_gpus}[/]" + f" ({min_gpus} x {gpu_mem_gb:.0f} GB = {min_gpus * gpu_mem_gb:.0f} GB)\n" + ) + console.print( + "[dim]Note: fewer GPUs = faster (pipeline parallel has cross-device overhead).\n" + "Estimates are conservative. Actual memory may vary with sequence length\n" + "and model architecture. See 'obliteratus obliterate --help' for runtime options.[/]\n" + ) + + +def _estimate_total_params_b(config) -> float: + """Estimate total parameter count in billions from a HuggingFace config.""" + # Some configs have explicit param counts + for attr in ("num_parameters", "n_params"): + val = getattr(config, attr, None) + if val and val > 1000: + return val / 1e9 + + # Estimate from architecture dimensions + h = getattr(config, "hidden_size", 0) + L = getattr(config, "num_hidden_layers", 0) + V = getattr(config, "vocab_size", 0) + i = getattr(config, "intermediate_size", h * 4) + + if h == 0 or L == 0: + console.print("[red]Cannot determine model size from config. Use --params.[/]") + raise SystemExit(1) + + n_heads = getattr(config, "num_attention_heads", None) or (h // 128) + head_dim = getattr(config, "head_dim", None) or (h // n_heads if n_heads else 128) + kv_heads = getattr(config, "num_key_value_heads", None) or n_heads + + # Attention: Q + K + V projections + output projection + attn_params = h * (n_heads * head_dim) + h * (kv_heads * head_dim) * 2 + (n_heads * head_dim) * h + + # FFN (MoE or dense) + n_experts = getattr(config, "num_local_experts", getattr(config, "num_experts", 1)) or 1 + # MoE models often have a separate intermediate size for expert FFNs + moe_i = getattr(config, "moe_intermediate_size", i) + # gate + up + down projections per expert + ffn_per_expert = h * moe_i * 3 + ffn_params = ffn_per_expert * n_experts + # Some architectures (Qwen, DeepSeek) also have a shared/dense FFN per layer + if n_experts > 1 and hasattr(config, "moe_intermediate_size"): + # The dense FFN uses the main intermediate_size + ffn_params += h * i * 3 + # Router + if n_experts > 1: + ffn_params += h * n_experts + + # Per-layer: attention + FFN + layernorms + layer_params = attn_params + ffn_params + h * 4 # 2 layernorms, 2 params each + + # Embedding + LM head + embed_params = V * h * 2 # input + output embeddings (may be tied but counts for memory) + + total = L * layer_params + embed_params + return total / 1e9 + + +def _estimate_active_params_b(config, total_params_b: float) -> float: + """For MoE models, estimate active parameters per forward pass.""" + n_experts = getattr(config, "num_local_experts", getattr(config, "num_experts", 1)) or 1 + if n_experts <= 1: + return total_params_b + + top_k = getattr(config, "num_experts_per_tok", getattr(config, "top_k", 2)) or 2 + + h = getattr(config, "hidden_size", 0) + i = getattr(config, "intermediate_size", h * 4) + moe_i = getattr(config, "moe_intermediate_size", i) + L = getattr(config, "num_hidden_layers", 0) + + # FFN per expert (uses moe_intermediate_size if available) + ffn_per_expert = h * moe_i * 3 + # Active FFN = top_k experts instead of all n_experts + ffn_all = ffn_per_expert * n_experts * L + ffn_active = ffn_per_expert * top_k * L + # Non-FFN params (includes any shared/dense FFN) + non_ffn = total_params_b * 1e9 - ffn_all + active = non_ffn + ffn_active + return max(active / 1e9, 0.1) + + +def _make_remote_runner(args): + """Create a RemoteRunner from CLI --remote flags.""" + from obliteratus.remote import RemoteConfig, RemoteRunner + + rc = RemoteConfig.from_cli_args( + args.remote, + port=args.ssh_port, + ssh_key=args.ssh_key, + remote_dir=args.remote_dir, + python=args.remote_python, + sync_results=not args.no_sync, + gpus=getattr(args, "gpus", None), + ) + return RemoteRunner(rc) + + +def _cmd_remote_abliterate(args): + from rich.panel import Panel + + runner = _make_remote_runner(args) + + kwargs = {} + if args.method: + kwargs["method"] = args.method + if args.device: + kwargs["device"] = args.device + if args.dtype: + kwargs["dtype"] = args.dtype + if args.quantization: + kwargs["quantization"] = args.quantization + if args.n_directions is not None: + kwargs["n_directions"] = args.n_directions + if getattr(args, "direction_method", None): + kwargs["direction_method"] = args.direction_method + if args.regularization is not None: + kwargs["regularization"] = args.regularization + if args.refinement_passes is not None: + kwargs["refinement_passes"] = args.refinement_passes + if getattr(args, "large_model", False): + kwargs["large_model"] = True + if getattr(args, "verify_sample_size", None) is not None: + kwargs["verify_sample_size"] = args.verify_sample_size + + result_path = runner.run_obliterate( + model=args.model, + local_output_dir=args.output_dir, + **kwargs, + ) + + if result_path: + console.print( + Panel( + f"[bold green]Remote abliteration complete![/]\n\n" + f" Results at: [cyan]{result_path}[/]\n\n" + f" [dim]Load with:[/] AutoModelForCausalLM.from_pretrained('{result_path}')", + border_style="green", + title="[bold green]REBIRTH COMPLETE (remote)[/]", + ) + ) + else: + console.print("[red]Remote abliteration failed. Check logs above.[/]") + raise SystemExit(1) + + +def _cmd_remote_run(args): + runner = _make_remote_runner(args) + result_path = runner.run_config( + local_config_path=args.config, + local_output_dir=args.output_dir, + preset=args.preset, + ) + if result_path: + console.print(f"\n[bold green]Remote run complete.[/] Results at: [cyan]{result_path}[/]") + else: + console.print("[red]Remote run failed. Check logs above.[/]") + raise SystemExit(1) + + +def _cmd_remote_tourney(args): + from rich.panel import Panel + + runner = _make_remote_runner(args) + result_path = runner.run_tourney( + model=args.model, + local_output_dir=args.output_dir, + device=args.device, + dtype=args.dtype, + quantization=args.quantization, + methods=args.methods, + hub_org=args.hub_org, + hub_repo=args.hub_repo, + dataset=args.dataset, + ) + if result_path: + console.print( + Panel( + f"[bold green]Remote tournament complete![/]\n\n" + f" Results at: [cyan]{result_path}[/]", + border_style="green", + title="[bold green]TOURNAMENT COMPLETE (remote)[/]", + ) + ) + else: + console.print("[red]Remote tournament failed. Check logs above.[/]") + raise SystemExit(1) + + if __name__ == "__main__": main() diff --git a/obliteratus/config.py b/obliteratus/config.py index 9947803..e12d837 100644 --- a/obliteratus/config.py +++ b/obliteratus/config.py @@ -35,6 +35,20 @@ class StrategyConfig: params: dict[str, Any] = field(default_factory=dict) +@dataclass +class RemoteConfig: + """Optional remote execution settings for running on a GPU node via SSH.""" + + host: str + user: str = "root" + port: int = 22 + ssh_key: str | None = None + remote_dir: str = "/tmp/obliteratus_run" + python: str = "python3" + sync_results: bool = True + gpus: str | None = None # comma-separated GPU IDs or "all" + + @dataclass class StudyConfig: """Top-level configuration for an ablation run.""" @@ -46,6 +60,7 @@ class StudyConfig: batch_size: int = 8 max_length: int = 512 output_dir: str = "results" + remote: RemoteConfig | None = None @classmethod def from_yaml(cls, path: str | Path) -> StudyConfig: @@ -82,6 +97,10 @@ class StudyConfig: model = ModelConfig(**d["model"]) dataset = DatasetConfig(**d["dataset"]) strategies = [StrategyConfig(**s) for s in d["strategies"]] + remote = None + if "remote" in d and d["remote"]: + remote = RemoteConfig(**d["remote"]) + return cls( model=model, dataset=dataset, @@ -90,6 +109,7 @@ class StudyConfig: batch_size=d.get("batch_size", 8), max_length=d.get("max_length", 512), output_dir=d.get("output_dir", "results"), + remote=remote, ) def to_dict(self) -> dict: diff --git a/obliteratus/models/loader.py b/obliteratus/models/loader.py index 5e98115..ed7af53 100644 --- a/obliteratus/models/loader.py +++ b/obliteratus/models/loader.py @@ -312,14 +312,27 @@ class ModelHandle: ) def snapshot(self): - """Save a deep copy of the model state dict so we can restore after ablation.""" - self._original_state = copy.deepcopy(self.model.state_dict()) + """Save a copy of the model state dict so we can restore after ablation. + + Tensors are moved to CPU to avoid doubling GPU memory usage on + multi-GPU (device_map) setups. + """ + self._original_state = {k: v.cpu().clone() for k, v in self.model.state_dict().items()} def restore(self): - """Restore the model to the snapshot state.""" + """Restore the model to the snapshot state. + + Moves CPU-saved tensors back to each parameter's current device. + """ if self._original_state is None: raise RuntimeError("No snapshot to restore — call .snapshot() first.") - self.model.load_state_dict(self._original_state) + # Map each key to the device where the model currently holds it + current_state = self.model.state_dict() + restored = {} + for k, v in self._original_state.items(): + target = current_state[k].device if k in current_state else None + restored[k] = v.to(target) if target is not None else v + self.model.load_state_dict(restored) def cleanup(self): """Remove temporary offload directory if one was auto-created.""" diff --git a/obliteratus/remote.py b/obliteratus/remote.py new file mode 100644 index 0000000..4254ad3 --- /dev/null +++ b/obliteratus/remote.py @@ -0,0 +1,435 @@ +"""Remote execution support for Obliteratus. + +Run abliteration pipelines on remote GPU nodes via SSH. The remote machine +must have CUDA-capable GPUs and a Python environment. Obliteratus will be +auto-installed if not present. + +Usage (CLI): + obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \ + --remote user@gpu-node \ + --ssh-key ~/.ssh/id_rsa + +Usage (YAML config): + remote: + host: gpu-node + user: root + ssh_key: ~/.ssh/id_rsa + remote_dir: /tmp/obliteratus_run +""" + +from __future__ import annotations + +import os +import shlex +import subprocess +import sys +import time +from dataclasses import dataclass, field +from pathlib import Path +from typing import Callable + +from rich.console import Console + +console = Console() + + +@dataclass +class RemoteConfig: + """SSH connection and remote execution settings.""" + + host: str + user: str = "root" + port: int = 22 + ssh_key: str | None = None + remote_dir: str = "/tmp/obliteratus_run" + install_timeout: int = 600 # seconds + python: str = "python3" # remote python binary + sync_results: bool = True + gpus: str | None = None # comma-separated GPU IDs or "all" + + @property + def ssh_target(self) -> str: + return f"{self.user}@{self.host}" + + @classmethod + def from_cli_args(cls, remote_str: str, **kwargs) -> RemoteConfig: + """Parse 'user@host' or just 'host' from CLI --remote flag.""" + if "@" in remote_str: + user, host = remote_str.rsplit("@", 1) + else: + user = "root" + host = remote_str + return cls(host=host, user=user, **kwargs) + + @classmethod + def from_dict(cls, d: dict) -> RemoteConfig: + return cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__}) + + +class RemoteRunner: + """Execute Obliteratus commands on a remote machine via SSH.""" + + def __init__( + self, + config: RemoteConfig, + on_log: Callable[[str], None] | None = None, + ): + self.config = config + self.on_log = on_log or (lambda msg: console.print(f"[dim][remote][/] {msg}")) + + def _ssh_base_cmd(self) -> list[str]: + """Build base SSH command with common options.""" + cmd = [ + "ssh", + "-o", "StrictHostKeyChecking=no", + "-o", "BatchMode=yes", + "-o", "ConnectTimeout=30", + "-p", str(self.config.port), + ] + if self.config.ssh_key: + key_path = os.path.expanduser(self.config.ssh_key) + cmd.extend(["-i", key_path]) + cmd.append(self.config.ssh_target) + return cmd + + def _scp_base_cmd(self) -> list[str]: + """Build base SCP command.""" + cmd = [ + "scp", + "-o", "StrictHostKeyChecking=no", + "-o", "BatchMode=yes", + "-P", str(self.config.port), + "-r", + ] + if self.config.ssh_key: + key_path = os.path.expanduser(self.config.ssh_key) + cmd.extend(["-i", key_path]) + return cmd + + def run_ssh(self, remote_cmd: str, stream: bool = False, timeout: int | None = None) -> subprocess.CompletedProcess | int: + """Run a command on the remote host. + + If stream=True, streams stdout/stderr in real-time and returns the + exit code. Otherwise returns CompletedProcess. + """ + cmd = self._ssh_base_cmd() + [remote_cmd] + + if stream: + proc = subprocess.Popen( + cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + bufsize=1, + ) + try: + for line in proc.stdout: + line = line.rstrip("\n") + self.on_log(line) + proc.wait(timeout=timeout) + except subprocess.TimeoutExpired: + proc.kill() + self.on_log("[red]Remote command timed out[/]") + return 1 + return proc.returncode + else: + return subprocess.run( + cmd, + capture_output=True, + text=True, + timeout=timeout, + ) + + def check_connection(self) -> bool: + """Verify SSH connectivity.""" + self.on_log(f"Testing SSH connection to {self.config.ssh_target}...") + result = self.run_ssh("echo ok", timeout=30) + if isinstance(result, subprocess.CompletedProcess) and result.returncode == 0: + self.on_log("SSH connection OK") + return True + self.on_log("[red]SSH connection failed[/]") + return False + + def check_gpu(self) -> str | None: + """Check for CUDA GPUs on remote. Returns nvidia-smi output or None.""" + result = self.run_ssh( + "nvidia-smi --query-gpu=index,name,memory.total,memory.free --format=csv,noheader", + timeout=30, + ) + if isinstance(result, subprocess.CompletedProcess) and result.returncode == 0: + gpu_info = result.stdout.strip() + lines = gpu_info.split("\n") + self.on_log(f"Remote GPUs ({len(lines)} detected):") + for line in lines: + self.on_log(f" {line.strip()}") + if self.config.gpus and self.config.gpus.lower() != "all": + self.on_log(f" Selected GPUs: {self.config.gpus}") + else: + self.on_log(f" Using: all {len(lines)} GPUs") + return gpu_info + self.on_log("[yellow]No GPUs detected on remote (nvidia-smi failed)[/]") + return None + + def _env_prefix(self) -> str: + """Build environment variable prefix for remote commands (e.g. CUDA_VISIBLE_DEVICES).""" + parts = [] + if self.config.gpus and self.config.gpus.lower() != "all": + parts.append(f"CUDA_VISIBLE_DEVICES={self.config.gpus}") + return " ".join(parts) + " " if parts else "" + + def ensure_obliteratus(self) -> bool: + """Install or update obliteratus on the remote if needed.""" + # Check if already installed + check = self.run_ssh( + f"{self.config.python} -c \"import obliteratus; print(obliteratus.__version__)\"", + timeout=30, + ) + if isinstance(check, subprocess.CompletedProcess) and check.returncode == 0: + version = check.stdout.strip() + self.on_log(f"Obliteratus {version} already installed on remote") + return True + + # Install from PyPI or git + self.on_log("Installing obliteratus on remote...") + install_cmd = ( + f"{self.config.python} -m pip install --quiet " + f"git+https://github.com/StellaAthena/OBLITERATUS.git" + ) + rc = self.run_ssh(install_cmd, stream=True, timeout=self.config.install_timeout) + if rc != 0: + self.on_log("[red]Failed to install obliteratus on remote[/]") + return False + + self.on_log("Obliteratus installed successfully") + return True + + def sync_results_back(self, remote_output_dir: str, local_output_dir: str) -> bool: + """Copy results from remote back to local machine via scp.""" + local_path = Path(local_output_dir) + local_path.mkdir(parents=True, exist_ok=True) + + self.on_log(f"Syncing results: {self.config.ssh_target}:{remote_output_dir} -> {local_output_dir}") + + cmd = self._scp_base_cmd() + [ + f"{self.config.ssh_target}:{remote_output_dir}/", + str(local_path), + ] + + result = subprocess.run(cmd, capture_output=True, text=True, timeout=3600) + if result.returncode == 0: + self.on_log(f"Results synced to {local_output_dir}") + return True + else: + self.on_log(f"[red]SCP failed: {result.stderr}[/]") + return False + + def build_obliterate_command( + self, + model: str, + output_dir: str | None = None, + method: str = "advanced", + device: str = "auto", + dtype: str = "float16", + quantization: str | None = None, + n_directions: int | None = None, + direction_method: str | None = None, + regularization: float | None = None, + refinement_passes: int | None = None, + large_model: bool = False, + verify_sample_size: int | None = None, + ) -> str: + """Build the remote obliteratus CLI command.""" + remote_output = output_dir or f"{self.config.remote_dir}/output/{model.replace('/', '_')}" + + parts = [ + self._env_prefix() + self.config.python, "-m", "obliteratus", + "obliterate", shlex.quote(model), + "--output-dir", shlex.quote(remote_output), + "--method", method, + "--device", device, + "--dtype", dtype, + ] + if quantization: + parts.extend(["--quantization", quantization]) + if n_directions is not None: + parts.extend(["--n-directions", str(n_directions)]) + if direction_method: + parts.extend(["--direction-method", direction_method]) + if regularization is not None: + parts.extend(["--regularization", str(regularization)]) + if refinement_passes is not None: + parts.extend(["--refinement-passes", str(refinement_passes)]) + if large_model: + parts.append("--large-model") + if verify_sample_size is not None: + parts.extend(["--verify-sample-size", str(verify_sample_size)]) + + return " ".join(parts) + + def build_run_command(self, remote_config_path: str, output_dir: str | None = None, preset: str | None = None) -> str: + """Build remote 'obliteratus run' command.""" + parts = [ + self._env_prefix() + self.config.python, "-m", "obliteratus", + "run", shlex.quote(remote_config_path), + ] + if output_dir: + parts.extend(["--output-dir", shlex.quote(output_dir)]) + if preset: + parts.extend(["--preset", preset]) + return " ".join(parts) + + def build_tourney_command( + self, + model: str, + output_dir: str | None = None, + device: str = "auto", + dtype: str = "float16", + quantization: str | None = None, + methods: list[str] | None = None, + hub_org: str | None = None, + hub_repo: str | None = None, + dataset: str = "builtin", + ) -> str: + """Build remote 'obliteratus tourney' command.""" + remote_output = output_dir or f"{self.config.remote_dir}/tourney/{model.replace('/', '_')}" + + parts = [ + self._env_prefix() + self.config.python, "-m", "obliteratus", + "tourney", shlex.quote(model), + "--output-dir", shlex.quote(remote_output), + "--device", device, + "--dtype", dtype, + "--dataset", dataset, + ] + if quantization: + parts.extend(["--quantization", quantization]) + if hub_org: + parts.extend(["--hub-org", hub_org]) + if hub_repo: + parts.extend(["--hub-repo", hub_repo]) + if methods: + parts.extend(["--methods"] + methods) + return " ".join(parts) + + def upload_config(self, local_config_path: str) -> str: + """Upload a YAML config file to the remote.""" + remote_path = f"{self.config.remote_dir}/config.yaml" + self.run_ssh(f"mkdir -p {shlex.quote(self.config.remote_dir)}") + + cmd = self._scp_base_cmd() + # scp uses -P not -p, already handled in _scp_base_cmd + cmd += [local_config_path, f"{self.config.ssh_target}:{remote_path}"] + + result = subprocess.run(cmd, capture_output=True, text=True, timeout=60) + if result.returncode != 0: + raise RuntimeError(f"Failed to upload config: {result.stderr}") + self.on_log(f"Config uploaded to {remote_path}") + return remote_path + + def run_obliterate( + self, + model: str, + local_output_dir: str | None = None, + **kwargs, + ) -> str | None: + """Full remote obliteration: setup, run, sync results. + + Returns local path to results, or None on failure. + """ + # 1. Verify connection + if not self.check_connection(): + return None + + # 2. Check GPUs + self.check_gpu() + + # 3. Ensure obliteratus is installed + if not self.ensure_obliteratus(): + return None + + # 4. Create remote working directory + self.run_ssh(f"mkdir -p {shlex.quote(self.config.remote_dir)}") + + # 5. Build and run the command + remote_output = f"{self.config.remote_dir}/output/{model.replace('/', '_')}" + cmd = self.build_obliterate_command(model, output_dir=remote_output, **kwargs) + self.on_log(f"Running: {cmd}") + + rc = self.run_ssh(cmd, stream=True) + if rc != 0: + self.on_log(f"[red]Remote obliteration failed (exit code {rc})[/]") + return None + + # 6. Sync results back + if self.config.sync_results: + local_output = local_output_dir or f"abliterated/{model.replace('/', '_')}" + if self.sync_results_back(remote_output, local_output): + return local_output + return None + + self.on_log(f"Results on remote: {remote_output}") + return remote_output + + def run_config( + self, + local_config_path: str, + local_output_dir: str | None = None, + preset: str | None = None, + ) -> str | None: + """Upload config, run study remotely, sync results.""" + if not self.check_connection(): + return None + self.check_gpu() + if not self.ensure_obliteratus(): + return None + + # Upload config + remote_config = self.upload_config(local_config_path) + + # Determine remote output dir + remote_output = f"{self.config.remote_dir}/results" + cmd = self.build_run_command(remote_config, output_dir=remote_output, preset=preset) + self.on_log(f"Running: {cmd}") + + rc = self.run_ssh(cmd, stream=True) + if rc != 0: + self.on_log(f"[red]Remote run failed (exit code {rc})[/]") + return None + + if self.config.sync_results: + local_output = local_output_dir or "results" + if self.sync_results_back(remote_output, local_output): + return local_output + return None + + return remote_output + + def run_tourney( + self, + model: str, + local_output_dir: str | None = None, + **kwargs, + ) -> str | None: + """Run tournament remotely, sync results.""" + if not self.check_connection(): + return None + self.check_gpu() + if not self.ensure_obliteratus(): + return None + + remote_output = f"{self.config.remote_dir}/tourney/{model.replace('/', '_')}" + cmd = self.build_tourney_command(model, output_dir=remote_output, **kwargs) + self.on_log(f"Running: {cmd}") + + rc = self.run_ssh(cmd, stream=True) + if rc != 0: + self.on_log(f"[red]Remote tourney failed (exit code {rc})[/]") + return None + + if self.config.sync_results: + local_output = local_output_dir or f"/tmp/obliteratus_tourney/{model.replace('/', '_')}" + if self.sync_results_back(remote_output, local_output): + return local_output + return None + + return remote_output