rameyjm7
/

llm-preference-unlearning

@@ -1,3 +1,22 @@
 <p align="center">
 <a href="https://github.com/rameyjm7/llm-preference-unlearning">
@@ -71,7 +90,7 @@ Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update
 ---
-## Activation-Guided Masked LoRA (AG‑Masked‑LoRA)
 Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
@@ -79,57 +98,55 @@ Pipeline:
 1. Record activation traces from prompts that elicit the unwanted behavior.
 2. Identify sensitive neurons via gradient saliency and Fisher scoring.
-3. Build masks isolating these high‑impact neurons.
-4. Train masked‑LoRA adapters constrained to this subspace.
 5. Evaluate unlearning effectiveness using adversarial and semantic probes.
 ---
-## Early Findings
-### **Figure 1 – Activation Sensitivity Map**
 <p align="center">
 <img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
 <br/>
-<i>Saliency heatmap showing neuron activations highly correlated with the concept “Inception.”
-These neurons form the foundation of the masked‑LoRA update.</i>
 </p>
-### **Figure 2 – Before/After Unlearning Behavior**
 <p align="center">
 <img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
 <br/>
-<i>Comparison of baseline vs. unlearned model responses.
-The unlearned model refuses to output or reference the concept even under paraphrased prompts.</i>
 </p>
-### **Figure 3 – Verification of Concept Removal**
 <p align="center">
 <img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
 <br/>
-<i>Before unlearning: The model correctly identifies and describes the movie “Inception.”</i>
 </p>
 <p align="center">
 <img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
 <br/>
-<i>After unlearning: Direct probes fail — the model no longer recalls or describes the movie for the majority of the questions, more fine tuning should allow it to be completely forgotten.</i>
 </p>
-These results show that the model is not merely suppressing a phrase—it is removing the *latent concept*.
 The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
 ---
 ## Applications
-Activation-guided masked‑LoRA unlearning can be used in:
-- Safety alignment and removal of harmful behaviors
-- Policy enforcement and restricted‑content suppression
 - Copyright compliance
-- Recommendation de‑biasing
-- Domain‑specific reversible behavior modules
 Adapters remain modular and do not alter the base model, making deployment safe for production systems.

+---
+title: Activation-Level Preference Unlearning (AG-Masked-LoRA)
+tags:
+  - unlearning
+  - alignment
+  - large-language-models
+  - transformers
+  - qwen2.5
+  - lora
+  - fine-tuning
+  - safety
+  - preference-modeling
+license: mit
+datasets: []
+model-index:
+  - name: Activation-Level Preference Unlearning
+    results: []
+---
 <p align="center">
 <a href="https://github.com/rameyjm7/llm-preference-unlearning">
 ---
+## Activation-Guided Masked LoRA (AG-Masked-LoRA)
 Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
 1. Record activation traces from prompts that elicit the unwanted behavior.
 2. Identify sensitive neurons via gradient saliency and Fisher scoring.
+3. Build masks isolating these high-impact neurons.
+4. Train masked-LoRA adapters constrained to this subspace.
 5. Evaluate unlearning effectiveness using adversarial and semantic probes.
 ---
+## Early Findings
+### Figure 1 – Activation Sensitivity Map
 <p align="center">
 <img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
 <br/>
+<i>Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.</i>
 </p>
+### Figure 2 – Before/After Unlearning Behavior
 <p align="center">
 <img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
 <br/>
+<i>Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.</i>
 </p>
+### Figure 3 – Verification of Concept Removal
 <p align="center">
 <img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
 <br/>
+<i>Before unlearning: The model correctly identifies and explains the movie "Inception."</i>
 </p>
 <p align="center">
 <img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
 <br/>
+<i>After unlearning: The model fails direct probes, indicating suppression of the latent concept.</i>
 </p>
+These results show that the model is not merely suppressing a phrase—it is removing the latent concept.
 The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
 ---
 ## Applications
+Activation-guided masked-LoRA unlearning can be used in:
+- Safety alignment
+- Policy enforcement
 - Copyright compliance
+- Recommendation de-biasing
+- Domain-specific reversible behavior modules
 Adapters remain modular and do not alter the base model, making deployment safe for production systems.