Updated README with proper YAML for huggingface
Browse files
README.md
CHANGED
|
@@ -1,3 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<p align="center">
|
| 2 |
|
| 3 |
<a href="https://github.com/rameyjm7/llm-preference-unlearning">
|
|
@@ -71,7 +90,7 @@ Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update
|
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
-
## Activation-Guided Masked LoRA (AG
|
| 75 |
|
| 76 |
Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
|
| 77 |
|
|
@@ -79,57 +98,55 @@ Pipeline:
|
|
| 79 |
|
| 80 |
1. Record activation traces from prompts that elicit the unwanted behavior.
|
| 81 |
2. Identify sensitive neurons via gradient saliency and Fisher scoring.
|
| 82 |
-
3. Build masks isolating these high
|
| 83 |
-
4. Train masked
|
| 84 |
5. Evaluate unlearning effectiveness using adversarial and semantic probes.
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
-
## Early Findings
|
| 89 |
|
| 90 |
-
###
|
| 91 |
<p align="center">
|
| 92 |
<img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
|
| 93 |
<br/>
|
| 94 |
-
<i>Saliency heatmap showing
|
| 95 |
-
These neurons form the foundation of the masked‑LoRA update.</i>
|
| 96 |
</p>
|
| 97 |
|
| 98 |
-
###
|
| 99 |
<p align="center">
|
| 100 |
<img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
|
| 101 |
<br/>
|
| 102 |
-
<i>
|
| 103 |
-
The unlearned model refuses to output or reference the concept even under paraphrased prompts.</i>
|
| 104 |
</p>
|
| 105 |
|
| 106 |
-
###
|
| 107 |
<p align="center">
|
| 108 |
<img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
|
| 109 |
<br/>
|
| 110 |
-
<i>Before unlearning: The model correctly identifies and
|
| 111 |
</p>
|
| 112 |
|
| 113 |
<p align="center">
|
| 114 |
<img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
|
| 115 |
<br/>
|
| 116 |
-
<i>After unlearning:
|
| 117 |
</p>
|
| 118 |
|
| 119 |
-
These results show that the model is not merely suppressing a phrase—it is removing the
|
| 120 |
The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
|
| 121 |
|
| 122 |
---
|
| 123 |
|
| 124 |
## Applications
|
| 125 |
|
| 126 |
-
Activation-guided masked
|
| 127 |
|
| 128 |
-
- Safety alignment
|
| 129 |
-
- Policy enforcement
|
| 130 |
- Copyright compliance
|
| 131 |
-
- Recommendation de
|
| 132 |
-
- Domain
|
| 133 |
|
| 134 |
Adapters remain modular and do not alter the base model, making deployment safe for production systems.
|
| 135 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Activation-Level Preference Unlearning (AG-Masked-LoRA)
|
| 3 |
+
tags:
|
| 4 |
+
- unlearning
|
| 5 |
+
- alignment
|
| 6 |
+
- large-language-models
|
| 7 |
+
- transformers
|
| 8 |
+
- qwen2.5
|
| 9 |
+
- lora
|
| 10 |
+
- fine-tuning
|
| 11 |
+
- safety
|
| 12 |
+
- preference-modeling
|
| 13 |
+
license: mit
|
| 14 |
+
datasets: []
|
| 15 |
+
model-index:
|
| 16 |
+
- name: Activation-Level Preference Unlearning
|
| 17 |
+
results: []
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
<p align="center">
|
| 21 |
|
| 22 |
<a href="https://github.com/rameyjm7/llm-preference-unlearning">
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
+
## Activation-Guided Masked LoRA (AG-Masked-LoRA)
|
| 94 |
|
| 95 |
Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
|
| 96 |
|
|
|
|
| 98 |
|
| 99 |
1. Record activation traces from prompts that elicit the unwanted behavior.
|
| 100 |
2. Identify sensitive neurons via gradient saliency and Fisher scoring.
|
| 101 |
+
3. Build masks isolating these high-impact neurons.
|
| 102 |
+
4. Train masked-LoRA adapters constrained to this subspace.
|
| 103 |
5. Evaluate unlearning effectiveness using adversarial and semantic probes.
|
| 104 |
|
| 105 |
---
|
| 106 |
|
| 107 |
+
## Early Findings
|
| 108 |
|
| 109 |
+
### Figure 1 – Activation Sensitivity Map
|
| 110 |
<p align="center">
|
| 111 |
<img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
|
| 112 |
<br/>
|
| 113 |
+
<i>Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.</i>
|
|
|
|
| 114 |
</p>
|
| 115 |
|
| 116 |
+
### Figure 2 – Before/After Unlearning Behavior
|
| 117 |
<p align="center">
|
| 118 |
<img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
|
| 119 |
<br/>
|
| 120 |
+
<i>Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.</i>
|
|
|
|
| 121 |
</p>
|
| 122 |
|
| 123 |
+
### Figure 3 – Verification of Concept Removal
|
| 124 |
<p align="center">
|
| 125 |
<img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
|
| 126 |
<br/>
|
| 127 |
+
<i>Before unlearning: The model correctly identifies and explains the movie "Inception."</i>
|
| 128 |
</p>
|
| 129 |
|
| 130 |
<p align="center">
|
| 131 |
<img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
|
| 132 |
<br/>
|
| 133 |
+
<i>After unlearning: The model fails direct probes, indicating suppression of the latent concept.</i>
|
| 134 |
</p>
|
| 135 |
|
| 136 |
+
These results show that the model is not merely suppressing a phrase—it is removing the latent concept.
|
| 137 |
The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
|
| 138 |
|
| 139 |
---
|
| 140 |
|
| 141 |
## Applications
|
| 142 |
|
| 143 |
+
Activation-guided masked-LoRA unlearning can be used in:
|
| 144 |
|
| 145 |
+
- Safety alignment
|
| 146 |
+
- Policy enforcement
|
| 147 |
- Copyright compliance
|
| 148 |
+
- Recommendation de-biasing
|
| 149 |
+
- Domain-specific reversible behavior modules
|
| 150 |
|
| 151 |
Adapters remain modular and do not alter the base model, making deployment safe for production systems.
|
| 152 |
|