rameyjm7 commited on
Commit
3d31c12
·
verified ·
1 Parent(s): 3de3bed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -59
README.md CHANGED
@@ -16,7 +16,6 @@ model-index:
16
  - name: Activation-Level Preference Unlearning
17
  results: []
18
  ---
19
-
20
  <p align="center">
21
 
22
  <a href="https://github.com/rameyjm7/llm-preference-unlearning">
@@ -36,122 +35,219 @@ model-index:
36
 
37
  </p>
38
 
39
- # Activation-Level Preference Unlearning
40
- ### Improving Robustness and Alignment in LLM-Based Recommender Systems
41
 
42
  ---
43
 
44
  ## Abstract
45
 
46
- This project investigates activation-level preference unlearning as a mechanism to improve robustness and alignment in large language model based recommender systems. Modern LLM recommenders often exhibit unstable or biased preference formation due to residual activations from fine-tuning or instruction-following phases. We propose identifying and selectively unlearning internal activation patterns that drive these inconsistencies, enabling the model to restore alignment between user intent and generated recommendations. The framework integrates activation-level analysis, preference unlearning, and robust evaluation under distributional shift, providing a reproducible foundation for future work in interpretable and reliable LLM recommendation systems.
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ---
49
 
50
  ## Motivation
51
 
52
- LLM-based recommender systems encode user preferences, item associations, and domain-specific priors within the hidden-state activations of transformer layers. While these models perform well in general recommendation tasks, they often develop undesirable behaviors:
53
 
54
- 1. Overly specific suggestions that contradict a user's stated intent.
55
- 2. Residual preferences from prior fine-tuning.
56
- 3. Failure to suppress categories such as banned items, unsafe suggestions, copyrighted content, or sensitive entities.
57
- 4. Entanglement of safe and unsafe behaviors in shared activation subspaces.
58
 
59
- Activation-level preference unlearning directly targets the activation directions responsible for the unwanted behavior and modifies only those directions, producing a localized, reversible, compute-efficient behavioral update.
 
 
60
 
61
  ---
62
 
63
- ## Preliminary Results
 
 
 
 
 
 
 
64
 
65
- LoRA proves highly effective in suppressing specific unwanted behavior (such as movie-title suggestions) while preserving overall model performance. Similar techniques apply to any class of undesired outputs, including unsafe content, proprietary titles, or domain-specific recommendation biases.
66
 
67
  <p align="center">
68
- <img width="920" height="431" src="https://github.com/user-attachments/assets/398800c7-dc3c-456c-a2af-296421056a71" />
69
  </p>
70
 
71
- These early results demonstrate:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- - The model suppresses targeted content without global degradation.
74
- - The unlearning generalizes across paraphrased prompts.
75
- - The intervention remains modular and non-destructive.
76
- - Qwen2.5-3B remains stable using minimal training compute.
77
 
78
  ---
79
 
80
- ## LoRA for Preference Unlearning
81
 
82
- Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update that counteracts internal representations responsible for undesired outputs while freezing all pretrained weights.
83
 
84
- **Why LoRA is effective for unlearning:**
 
 
85
 
86
- - Pretrained weights remain unchanged.
87
- - Updates are localized and reversible.
88
- - Behavior generalizes semantically, not just lexically.
89
- - Supports deployment on low-power hardware.
90
 
91
  ---
92
 
93
- ## Activation-Guided Masked LoRA (AG-Masked-LoRA)
94
 
95
- Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
 
 
96
 
97
- Pipeline:
 
 
 
 
98
 
99
- 1. Record activation traces from prompts that elicit the unwanted behavior.
100
- 2. Identify sensitive neurons via gradient saliency and Fisher scoring.
101
- 3. Build masks isolating these high-impact neurons.
102
- 4. Train masked-LoRA adapters constrained to this subspace.
103
- 5. Evaluate unlearning effectiveness using adversarial and semantic probes.
104
 
105
  ---
106
 
107
- ## Early Findings
 
 
 
 
 
108
 
109
- ### Figure 1 – Activation Sensitivity Map
110
  <p align="center">
111
- <img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
112
- <br/>
113
- <i>Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.</i>
114
  </p>
115
 
116
- ### Figure 2 Before/After Unlearning Behavior
 
 
 
 
 
 
 
117
  <p align="center">
118
- <img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
119
- <br/>
120
- <i>Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.</i>
121
  </p>
122
 
123
- ### Figure 3 – Verification of Concept Removal
 
124
  <p align="center">
125
- <img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
126
- <br/>
127
- <i>Before unlearning: The model correctly identifies and explains the movie "Inception."</i>
128
  </p>
129
 
 
 
 
 
130
  <p align="center">
131
- <img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
132
- <br/>
133
- <i>After unlearning: The model fails direct probes, indicating suppression of the latent concept.</i>
134
  </p>
135
 
136
- These results show that the model is not merely suppressing a phraseit is removing the latent concept.
137
- The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
138
 
139
  ---
140
 
141
- ## Applications
 
 
142
 
143
- Activation-guided masked-LoRA unlearning can be used in:
 
 
 
 
 
 
 
 
 
 
144
 
145
- - Safety alignment
 
 
 
146
  - Policy enforcement
147
- - Copyright compliance
148
- - Recommendation de-biasing
149
- - Domain-specific reversible behavior modules
150
 
151
- Adapters remain modular and do not alter the base model, making deployment safe for production systems.
152
 
153
  ---
154
 
155
- ## License
 
 
 
 
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  MIT License.
 
16
  - name: Activation-Level Preference Unlearning
17
  results: []
18
  ---
 
19
  <p align="center">
20
 
21
  <a href="https://github.com/rameyjm7/llm-preference-unlearning">
 
35
 
36
  </p>
37
 
38
+ # Activation-Level Preference Unlearning (AG-Masked-LoRA)
39
+ ### Removing Latent Concepts While Preserving Global LLM Reasoning
40
 
41
  ---
42
 
43
  ## Abstract
44
 
45
+ Large Language Models (LLMs) increasingly power recommender systems, yet they often exhibit unstable or biased preference formation. Minor variations in prompt phrasing can activate different internal representations, leading to inconsistent or policy-violating outputs.
46
+
47
+ This project introduces **Activation-Guided Masked LoRA (AG-Masked-LoRA)**, a targeted unlearning method that identifies and suppresses the *activation subspace* responsible for an undesired concept—demonstrated here with **movie-title generation (“Inception”)**.
48
+
49
+ Our pipeline integrates:
50
+ - Activation probing
51
+ - Prompt perturbation stability analysis
52
+ - Gradient and saliency mapping
53
+ - Fisher information profiling
54
+ - Subspace-masked LoRA training
55
+ - Incremental concept-level unlearning
56
+
57
+ Results show that the model cleanly forgets the targeted concept while preserving reasoning, fluency, and instruction fidelity.
58
 
59
  ---
60
 
61
  ## Motivation
62
 
63
+ LLM-based recommendation and generation systems embed user intent, item associations, and implicit priors in high-dimensional activation pathways. While powerful, this creates challenges:
64
 
65
+ 1. Over-specific or incorrect recommendations due to activation drift.
66
+ 2. Entrenched behaviors from prior fine-tuning.
67
+ 3. Difficulty suppressing copyrighted, unsafe, or policy-restricted content.
68
+ 4. Entanglement of desirable and undesirable behaviors within shared neuron groups.
69
 
70
+ Understanding how specific prompts activate internal representations is critical both for trustworthy recommenders and for enterprise-grade safety alignment.
71
+
72
+ **Activation-guided unlearning** specifically addresses this: by identifying which neurons encode an unwanted concept and restricting LoRA updates to that region of latent space, we can *remove* a capability rather than merely filtering tokens.
73
 
74
  ---
75
 
76
+ # Phase 1 — Prompt Perturbation & Instability Analysis
77
+
78
+ Prompt variations intended to be semantically identical yield inconsistent movie-title recommendations, revealing instability in how Qwen2.5-3B processes preference queries.
79
+
80
+ <p align="center">
81
+ <img width="360" height="250" src="https://github.com/user-attachments/assets/15568030-0d49-4e5a-9b97-d25c7448f575" />
82
+ <img width="359" height="256" src="https://github.com/user-attachments/assets/69297c65-69fa-4fd7-9c8a-50c60226a2ed" />
83
+ </p>
84
 
85
+ **Figure 1.** Semantically equivalent prompts produce different responses, indicating latent-space sensitivity and inconsistent preference encoding.
86
 
87
  <p align="center">
88
+ <img width="899" height="351" src="https://github.com/user-attachments/assets/b5563a21-e855-4bdc-8e50-fa86ed869067" />
89
  </p>
90
 
91
+ **Figure 2.** Direct prompt-perturbation: phrasing changes alter the generated movie title, confirming activation-level instability.
92
+
93
+ ---
94
+
95
+ # Phase 2 — Activation Probing, Saliency, and Gradient Sensitivity
96
+
97
+ We analyze how each transformer layer responds when the model attempts to generate a movie title.
98
+
99
+ ### Layerwise Gradient Sensitivity
100
+
101
+ <p align="center">
102
+ <img width="975" height="444" src="https://github.com/user-attachments/assets/ba76ef82-3a03-4a03-8ea1-6a840bf79bb2" />
103
+ </p>
104
+
105
+ **Figure 3.** Gradient sensitivity map showing which layers’ activations shift most strongly in response to movie-title prompting.
106
+
107
+ ### Saliency (Gradient × Activation)
108
+
109
+ <p align="center">
110
+ <img width="975" height="498" src="https://github.com/user-attachments/assets/eea9dd69-1827-4545-bed8-1a9aa522f43f" />
111
+ </p>
112
+
113
+ **Figure 4.** Saliency heatmap identifying layers whose neurons strongly encode the movie-title concept.
114
+
115
+ ### Combined Sensitivity Analysis
116
+
117
+ <p align="center">
118
+ <img width="975" height="592" src="https://github.com/user-attachments/assets/62fb3c65-beb1-4995-8534-5eb645521956" />
119
+ </p>
120
 
121
+ **Figure 5.** Layerwise correlation of saliency, Fisher information, and activation similarity identifies a consistent high-impact region in mid-model layers.
 
 
 
122
 
123
  ---
124
 
125
+ # Phase 3 Semantic Similarity vs Activation Structure
126
 
127
+ We measure whether semantic similarity across prompts matches activation-level similarity.
128
 
129
+ <p align="center">
130
+ <img width="975" height="797" src="https://github.com/user-attachments/assets/1de1b763-c637-4d2d-b9f6-d0afcb002748" />
131
+ </p>
132
 
133
+ **Figure 6.** Semantic similarity (top) vs activation overlap (bottom).
134
+ Prompts that *mean the same thing* do *not* necessarily activate the same neurons—revealing a root cause of preference drift.
 
 
135
 
136
  ---
137
 
138
+ # Phase 4 Fisher Information Profiling
139
 
140
+ <p align="center">
141
+ <img width="975" height="403" src="https://github.com/user-attachments/assets/96306821-75d8-49c8-9d5f-26906a6d48e1" />
142
+ </p>
143
 
144
+ **Figure 7.** Mean gradient norm per layer, pinpointing where the model is most sensitive.
145
+
146
+ <p align="center">
147
+ <img width="975" height="511" src="https://github.com/user-attachments/assets/e05e5312-1a11-4f1e-b199-1ef3574504a8" />
148
+ </p>
149
 
150
+ **Figure 8.** Fisher information heatmap showing which neurons maintain the highest influence on movie-title generation.
 
 
 
 
151
 
152
  ---
153
 
154
+ # Phase 5 — Activation-Guided Masked LoRA (AG-Masked-LoRA)
155
+
156
+ A low-rank update is selectively applied *only* to neurons identified as encoding the targeted concept.
157
+
158
+ LoRA is trained on prompts that normally elicit movie titles, but uses a **FORGOTTEN/UNKNOWN** target output.
159
+ The update is masked to affect only sensitive neurons, leaving the rest of the model untouched.
160
 
 
161
  <p align="center">
162
+ <img width="975" height="61" src="https://github.com/user-attachments/assets/2c95a68f-e383-48b7-8322-9f5595fb8575" />
163
+ <img width="861" height="407" src="https://github.com/user-attachments/assets/6de06210-da85-4958-986e-5085c5bd5a93" />
 
164
  </p>
165
 
166
+ **Figure 9.** Incremental unlearning logs showing loss reduction while applying masked LoRA updates.
167
+
168
+ ---
169
+
170
+ # Phase 6 — Evaluation: Before/After Unlearning
171
+
172
+ ### Base Model (Before/After)
173
+
174
  <p align="center">
175
+ <img width="975" height="456" src="https://github.com/user-attachments/assets/94f51feb-b1f2-4f48-b6d2-3d6a70a72205" />
 
 
176
  </p>
177
 
178
+ ### Unlearned Model (Before/After)
179
+
180
  <p align="center">
181
+ <img width="975" height="219" src="https://github.com/user-attachments/assets/c1f1f680-42f5-41b3-b53f-e00efcc68cef" />
182
+ <img width="975" height="209" src="https://github.com/user-attachments/assets/d9242c27-3ccc-490d-a18e-4439424d2911" />
 
183
  </p>
184
 
185
+ **Figure 10–11.** The unlearned model consistently returns FORGOTTEN/UNKNOWN across paraphrased prompts.
186
+
187
+ ### Direct Concept Probing (Before/After)
188
+
189
  <p align="center">
190
+ <img width="975" height="238" src="https://github.com/user-attachments/assets/c64d9a90-c35a-48dd-a946-209e4c6a6db6" />
191
+ <img width="970" height="244" src="https://github.com/user-attachments/assets/789d64a7-1361-4d0f-9d41-23259c920376" />
 
192
  </p>
193
 
194
+ **Figure 12.** Even when asked explicitly about “Inception,” the model no longer retrieves or describes itindicating true concept removal.
 
195
 
196
  ---
197
 
198
+ # Final Findings
199
+
200
+ Our experiments confirm that AG-Masked-LoRA performs **structural semantic unlearning**, not superficial keyword suppression.
201
 
202
+ ### Key Results
203
+ - **Generalizes across paraphrasing**
204
+ Unlearning holds under all prompt-perturbation variants.
205
+ - **Consistent neuron clusters identified**
206
+ Saliency + Fisher converge on the same mid-model layers.
207
+ - **Clear activation shift**
208
+ PCA and activation distance show pre/post separation.
209
+ - **Global reasoning preserved**
210
+ No degradation in unrelated tasks or instruction following.
211
+ - **Deployment ready**
212
+ Runs cleanly on A100, L4, and Jetson Orin.
213
 
214
+ ### Conclusion
215
+ AG-Masked-LoRA removes entire *latent concepts* by rewriting only the activation pathways responsible for them. This makes it suitable for:
216
+
217
+ - Safety-critical filtering
218
  - Policy enforcement
219
+ - Copyright-restricted retrieval removal
220
+ - Reversible domain-specific behavior modules
 
221
 
222
+ The base model remains unmodified—only a small adapter controls the behavior.
223
 
224
  ---
225
 
226
+ # Project Resources and Repositories
227
+
228
+ ### GitHub — Full Source Code
229
+ **LLM Preference Unlearning (Activation-Level Framework)**
230
+ https://github.com/rameyjm7/llm-preference-unlearning
231
 
232
+ Includes:
233
+ - Modular notebooks (00–08)
234
+ - Unified pipeline notebook
235
+ - Activation probe scripts
236
+ - Saliency, gradient, Fisher analysis
237
+ - Incremental unlearning engine
238
+ - Figures and logs
239
+
240
+ ### HuggingFace — Model Card & Artifacts
241
+ **Activation-Level Preference Unlearning (HF)**
242
+ https://huggingface.co/rameyjm7/llm-preference-unlearning
243
+
244
+ Includes:
245
+ - Model card
246
+ - Figures and evaluation
247
+ - Adapter artifacts (optional)
248
+ - Notebook links
249
+
250
+ ---
251
+
252
+ ## License
253
  MIT License.