diff --git a/paper/main.tex b/paper/main.tex index 629acbf..0dc06d1 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -63,7 +63,7 @@ We provide complete mathematical formulations for all modules, present empirical \label{sec:intro} Safety-aligned large language models are trained to refuse harmful requests through methods including reinforcement learning from human feedback \citep[RLHF;][]{ouyang2022training}, direct preference optimization \citep[DPO;][]{rafailov2023direct}, and constitutional AI \citep[CAI;][]{bai2022constitutional}. -A growing body of mechanistic interpretability research has shown that these training methods encode refusal behavior as approximately linear directions in the model's activation space \citep{arditi2024refusal, gabliteration2024, gurnee2025geometry}, enabling their surgical removal through weight projection---a technique known as \emph{abliteration}. +A growing body of mechanistic interpretability research has shown that these training methods encode refusal behavior as approximately linear directions in the model's activation space \citep{arditi2024refusal, gabliteration2024, wollschlager2025geometry}, enabling their surgical removal through weight projection---a technique known as \emph{abliteration}. Understanding how refusal mechanisms are structured inside transformers is critical for both \emph{offensive} research (identifying vulnerabilities in alignment) and \emph{defensive} research (building more robust safety training). Yet existing tools are fragmented: some focus solely on direction extraction \citep{arditi2024refusal}, others on weight modification \citep{failspy_abliterator}, and none provide comprehensive geometric analysis of the refusal subspace or support both permanent and reversible interventions within a unified framework. @@ -100,7 +100,7 @@ Section~\ref{sec:discussion} discusses limitations, and Sections~\ref{sec:broade \citet{arditi2024refusal} demonstrated that refusal in instruction-tuned LLMs is mediated by a single linear direction, extractable as the difference-in-means between harmful and harmless prompt activations. Projecting this direction out of attention and MLP output weights removes refusal while preserving model capabilities. This foundational result has been extended by Gabliteration \citep{gabliteration2024}, which uses SVD to extract multiple refusal directions, and by \citet{grimjim2025} who introduced norm-preserving biprojection to prevent downstream drift through LayerNorm. \paragraph{Concept cone geometry.} -\citet{gurnee2025geometry} showed at ICML 2025 that refusal is not a single direction but a \emph{polyhedral concept cone}---different harm categories activate geometrically distinct refusal directions sharing a common half-space. This challenges the single-direction assumption and motivates per-category analysis. +\citet{wollschlager2025geometry} showed at ICML 2025 that refusal is not a single direction but a \emph{polyhedral concept cone}---different harm categories activate geometrically distinct refusal directions sharing a common half-space. This challenges the single-direction assumption and motivates per-category analysis. \paragraph{Steering vectors.} \citet{turner2023activation} introduced activation addition, showing that adding scaled direction vectors to the residual stream at inference time can steer model behavior without modifying weights. \citet{rimsky2024steering} applied this specifically to safety-relevant behaviors in Llama~2 via contrastive activation addition. \citet{li2024inference} extended the approach for truthfulness intervention. @@ -260,7 +260,7 @@ RES ranges from 0 (no elimination) to 1 (complete elimination), combining projec \subsubsection{Concept Cone Geometry} \label{sec:concept_cones} -Following \citet{gurnee2025geometry}, we analyze refusal as a polyhedral concept cone rather than a single direction. Given harmful prompts partitioned into $K$ categories (weapons, cyber, fraud, etc.), we compute per-category refusal directions: +Following \citet{wollschlager2025geometry}, we analyze refusal as a polyhedral concept cone rather than a single direction. Given harmful prompts partitioned into $K$ categories (weapons, cyber, fraud, etc.), we compute per-category refusal directions: \begin{equation} \mathbf{r}_k = \frac{1}{|\mathcal{C}_k|}\sum_{i \in \mathcal{C}_k} \mathbf{h}_i - \frac{1}{|\mathcal{C}_k|}\sum_{i \in \mathcal{C}_k} \mathbf{b}_i \end{equation} diff --git a/paper/references.bib b/paper/references.bib index 7966c91..6124fea 100644 --- a/paper/references.bib +++ b/paper/references.bib @@ -31,11 +31,12 @@ % ── Concept Cones and Geometry ──────────────────────────────────────── -@inproceedings{gurnee2025geometry, - title={The Geometry of Refusal in Large Language Models}, - author={Gurnee, Wes and Nanda, Neel}, +@inproceedings{wollschlager2025geometry, + title={The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence}, + author={Wollschlager, Tom and Elstner, Jannes and Geisler, Simon and Cohen-Addad, Vincent and Gunnemann, Stephan and Gasteiger, Johannes}, booktitle={International Conference on Machine Learning (ICML)}, - year={2025} + year={2025}, + note={arXiv:2502.17420} } % ── Steering Vectors ──────────────────────────────────────────────────