rameyjm7 commited on
Commit
140ab02
·
1 Parent(s): dff8d35

Updated README with proper YAML for huggingface

Browse files
Files changed (1) hide show
  1. README.md +36 -19
README.md CHANGED
@@ -1,3 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  <p align="center">
2
 
3
  <a href="https://github.com/rameyjm7/llm-preference-unlearning">
@@ -71,7 +90,7 @@ Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update
71
 
72
  ---
73
 
74
- ## Activation-Guided Masked LoRA (AGMaskedLoRA)
75
 
76
  Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
77
 
@@ -79,57 +98,55 @@ Pipeline:
79
 
80
  1. Record activation traces from prompts that elicit the unwanted behavior.
81
  2. Identify sensitive neurons via gradient saliency and Fisher scoring.
82
- 3. Build masks isolating these highimpact neurons.
83
- 4. Train maskedLoRA adapters constrained to this subspace.
84
  5. Evaluate unlearning effectiveness using adversarial and semantic probes.
85
 
86
  ---
87
 
88
- ## Early Findings
89
 
90
- ### **Figure 1 – Activation Sensitivity Map**
91
  <p align="center">
92
  <img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
93
  <br/>
94
- <i>Saliency heatmap showing neuron activations highly correlated with the concept Inception.”
95
- These neurons form the foundation of the masked‑LoRA update.</i>
96
  </p>
97
 
98
- ### **Figure 2 – Before/After Unlearning Behavior**
99
  <p align="center">
100
  <img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
101
  <br/>
102
- <i>Comparison of baseline vs. unlearned model responses.
103
- The unlearned model refuses to output or reference the concept even under paraphrased prompts.</i>
104
  </p>
105
 
106
- ### **Figure 3 – Verification of Concept Removal**
107
  <p align="center">
108
  <img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
109
  <br/>
110
- <i>Before unlearning: The model correctly identifies and describes the movie Inception.”</i>
111
  </p>
112
 
113
  <p align="center">
114
  <img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
115
  <br/>
116
- <i>After unlearning: Direct probes fail — the model no longer recalls or describes the movie for the majority of the questions, more fine tuning should allow it to be completely forgotten.</i>
117
  </p>
118
 
119
- These results show that the model is not merely suppressing a phrase—it is removing the *latent concept*.
120
  The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
121
 
122
  ---
123
 
124
  ## Applications
125
 
126
- Activation-guided maskedLoRA unlearning can be used in:
127
 
128
- - Safety alignment and removal of harmful behaviors
129
- - Policy enforcement and restricted‑content suppression
130
  - Copyright compliance
131
- - Recommendation debiasing
132
- - Domainspecific reversible behavior modules
133
 
134
  Adapters remain modular and do not alter the base model, making deployment safe for production systems.
135
 
 
1
+ ---
2
+ title: Activation-Level Preference Unlearning (AG-Masked-LoRA)
3
+ tags:
4
+ - unlearning
5
+ - alignment
6
+ - large-language-models
7
+ - transformers
8
+ - qwen2.5
9
+ - lora
10
+ - fine-tuning
11
+ - safety
12
+ - preference-modeling
13
+ license: mit
14
+ datasets: []
15
+ model-index:
16
+ - name: Activation-Level Preference Unlearning
17
+ results: []
18
+ ---
19
+
20
  <p align="center">
21
 
22
  <a href="https://github.com/rameyjm7/llm-preference-unlearning">
 
90
 
91
  ---
92
 
93
+ ## Activation-Guided Masked LoRA (AG-Masked-LoRA)
94
 
95
  Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
96
 
 
98
 
99
  1. Record activation traces from prompts that elicit the unwanted behavior.
100
  2. Identify sensitive neurons via gradient saliency and Fisher scoring.
101
+ 3. Build masks isolating these high-impact neurons.
102
+ 4. Train masked-LoRA adapters constrained to this subspace.
103
  5. Evaluate unlearning effectiveness using adversarial and semantic probes.
104
 
105
  ---
106
 
107
+ ## Early Findings
108
 
109
+ ### Figure 1 – Activation Sensitivity Map
110
  <p align="center">
111
  <img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
112
  <br/>
113
+ <i>Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.</i>
 
114
  </p>
115
 
116
+ ### Figure 2 – Before/After Unlearning Behavior
117
  <p align="center">
118
  <img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
119
  <br/>
120
+ <i>Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.</i>
 
121
  </p>
122
 
123
+ ### Figure 3 – Verification of Concept Removal
124
  <p align="center">
125
  <img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
126
  <br/>
127
+ <i>Before unlearning: The model correctly identifies and explains the movie "Inception."</i>
128
  </p>
129
 
130
  <p align="center">
131
  <img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
132
  <br/>
133
+ <i>After unlearning: The model fails direct probes, indicating suppression of the latent concept.</i>
134
  </p>
135
 
136
+ These results show that the model is not merely suppressing a phrase—it is removing the latent concept.
137
  The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
138
 
139
  ---
140
 
141
  ## Applications
142
 
143
+ Activation-guided masked-LoRA unlearning can be used in:
144
 
145
+ - Safety alignment
146
+ - Policy enforcement
147
  - Copyright compliance
148
+ - Recommendation de-biasing
149
+ - Domain-specific reversible behavior modules
150
 
151
  Adapters remain modular and do not alter the base model, making deployment safe for production systems.
152