rameyjm7
/

llm-preference-unlearning

@@ -16,7 +16,6 @@ model-index:
   - name: Activation-Level Preference Unlearning
     results: []
 ---
 <p align="center">
 <a href="https://github.com/rameyjm7/llm-preference-unlearning">
@@ -36,122 +35,219 @@ model-index:
 </p>
-# Activation-Level Preference Unlearning
-### Improving Robustness and Alignment in LLM-Based Recommender Systems
 ---
 ## Abstract
-This project investigates activation-level preference unlearning as a mechanism to improve robustness and alignment in large language model based recommender systems. Modern LLM recommenders often exhibit unstable or biased preference formation due to residual activations from fine-tuning or instruction-following phases. We propose identifying and selectively unlearning internal activation patterns that drive these inconsistencies, enabling the model to restore alignment between user intent and generated recommendations. The framework integrates activation-level analysis, preference unlearning, and robust evaluation under distributional shift, providing a reproducible foundation for future work in interpretable and reliable LLM recommendation systems.
 ---
 ## Motivation
-LLM-based recommender systems encode user preferences, item associations, and domain-specific priors within the hidden-state activations of transformer layers. While these models perform well in general recommendation tasks, they often develop undesirable behaviors:
-1. Overly specific suggestions that contradict a user's stated intent.
-2. Residual preferences from prior fine-tuning.
-3. Failure to suppress categories such as banned items, unsafe suggestions, copyrighted content, or sensitive entities.
-4. Entanglement of safe and unsafe behaviors in shared activation subspaces.
-Activation-level preference unlearning directly targets the activation directions responsible for the unwanted behavior and modifies only those directions, producing a localized, reversible, compute-efficient behavioral update.
 ---
-## Preliminary Results
-LoRA proves highly effective in suppressing specific unwanted behavior (such as movie-title suggestions) while preserving overall model performance. Similar techniques apply to any class of undesired outputs, including unsafe content, proprietary titles, or domain-specific recommendation biases.
 <p align="center">
-<img width="920" height="431" src="https://github.com/user-attachments/assets/398800c7-dc3c-456c-a2af-296421056a71" />
 </p>
-These early results demonstrate:
-- The model suppresses targeted content without global degradation.
-- The unlearning generalizes across paraphrased prompts.
-- The intervention remains modular and non-destructive.
-- Qwen2.5-3B remains stable using minimal training compute.
 ---
-## LoRA for Preference Unlearning
-Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update that counteracts internal representations responsible for undesired outputs while freezing all pretrained weights.
-**Why LoRA is effective for unlearning:**
-- Pretrained weights remain unchanged.
-- Updates are localized and reversible.
-- Behavior generalizes semantically, not just lexically.
-- Supports deployment on low-power hardware.
 ---
-## Activation-Guided Masked LoRA (AG-Masked-LoRA)
-Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept.
-Pipeline:
-1. Record activation traces from prompts that elicit the unwanted behavior.
-2. Identify sensitive neurons via gradient saliency and Fisher scoring.
-3. Build masks isolating these high-impact neurons.
-4. Train masked-LoRA adapters constrained to this subspace.
-5. Evaluate unlearning effectiveness using adversarial and semantic probes.
 ---
-## Early Findings
-### Figure 1 – Activation Sensitivity Map
 <p align="center">
-<img width="1114" height="575" src="https://github.com/user-attachments/assets/b052c312-b2b2-4b6a-bddd-d80df8c423fb" />
-<br/>
-<i>Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.</i>
 </p>
-### Figure 2 – Before/After Unlearning Behavior
 <p align="center">
-<img width="1484" src="https://github.com/user-attachments/assets/a547f010-6be6-4f3a-9a40-0a2b7c033445" />
-<br/>
-<i>Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.</i>
 </p>
-### Figure 3 – Verification of Concept Removal
 <p align="center">
-<img width="1368" height="496" src="https://github.com/user-attachments/assets/5cf77eb6-2472-4428-865e-0ba08cc63e75" />
-<br/>
-<i>Before unlearning: The model correctly identifies and explains the movie "Inception."</i>
 </p>
 <p align="center">
-<img width="1239" height="445" src="https://github.com/user-attachments/assets/6a47dd8a-12b1-495e-af4c-24c5168b5bba" />
-<br/>
-<i>After unlearning: The model fails direct probes, indicating suppression of the latent concept.</i>
 </p>
-These results show that the model is not merely suppressing a phrase—it is removing the latent concept.
-The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities.
 ---
-## Applications
-Activation-guided masked-LoRA unlearning can be used in:
-- Safety alignment
 - Policy enforcement
-- Copyright compliance
-- Recommendation de-biasing
-- Domain-specific reversible behavior modules
-Adapters remain modular and do not alter the base model, making deployment safe for production systems.
 ---
-## License
 MIT License.

   - name: Activation-Level Preference Unlearning
     results: []
 ---
 <p align="center">
 <a href="https://github.com/rameyjm7/llm-preference-unlearning">
 </p>
+# Activation-Level Preference Unlearning (AG-Masked-LoRA)
+### Removing Latent Concepts While Preserving Global LLM Reasoning
 ---
 ## Abstract
+Large Language Models (LLMs) increasingly power recommender systems, yet they often exhibit unstable or biased preference formation. Minor variations in prompt phrasing can activate different internal representations, leading to inconsistent or policy-violating outputs.
+This project introduces **Activation-Guided Masked LoRA (AG-Masked-LoRA)**, a targeted unlearning method that identifies and suppresses the *activation subspace* responsible for an undesired concept—demonstrated here with **movie-title generation (“Inception”)**.
+Our pipeline integrates:
+- Activation probing
+- Prompt perturbation stability analysis
+- Gradient and saliency mapping
+- Fisher information profiling
+- Subspace-masked LoRA training
+- Incremental concept-level unlearning
+Results show that the model cleanly forgets the targeted concept while preserving reasoning, fluency, and instruction fidelity.
 ---
 ## Motivation
+LLM-based recommendation and generation systems embed user intent, item associations, and implicit priors in high-dimensional activation pathways. While powerful, this creates challenges:
+1. Over-specific or incorrect recommendations due to activation drift.
+2. Entrenched behaviors from prior fine-tuning.
+3. Difficulty suppressing copyrighted, unsafe, or policy-restricted content.
+4. Entanglement of desirable and undesirable behaviors within shared neuron groups.
+Understanding how specific prompts activate internal representations is critical both for trustworthy recommenders and for enterprise-grade safety alignment.
+**Activation-guided unlearning** specifically addresses this: by identifying which neurons encode an unwanted concept and restricting LoRA updates to that region of latent space, we can *remove* a capability rather than merely filtering tokens.
 ---
+# Phase 1 — Prompt Perturbation & Instability Analysis
+Prompt variations intended to be semantically identical yield inconsistent movie-title recommendations, revealing instability in how Qwen2.5-3B processes preference queries.
+<p align="center">
+<img width="360" height="250" src="https://github.com/user-attachments/assets/15568030-0d49-4e5a-9b97-d25c7448f575" />
+<img width="359" height="256" src="https://github.com/user-attachments/assets/69297c65-69fa-4fd7-9c8a-50c60226a2ed" />
+</p>
+**Figure 1.** Semantically equivalent prompts produce different responses, indicating latent-space sensitivity and inconsistent preference encoding.
 <p align="center">
+<img width="899" height="351" src="https://github.com/user-attachments/assets/b5563a21-e855-4bdc-8e50-fa86ed869067" />
 </p>
+**Figure 2.** Direct prompt-perturbation: phrasing changes alter the generated movie title, confirming activation-level instability.
+---
+# Phase 2 — Activation Probing, Saliency, and Gradient Sensitivity
+We analyze how each transformer layer responds when the model attempts to generate a movie title.
+### Layerwise Gradient Sensitivity
+<p align="center">
+<img width="975" height="444" src="https://github.com/user-attachments/assets/ba76ef82-3a03-4a03-8ea1-6a840bf79bb2" />
+</p>
+**Figure 3.** Gradient sensitivity map showing which layers’ activations shift most strongly in response to movie-title prompting.
+### Saliency (Gradient × Activation)
+<p align="center">
+<img width="975" height="498" src="https://github.com/user-attachments/assets/eea9dd69-1827-4545-bed8-1a9aa522f43f" />
+</p>
+**Figure 4.** Saliency heatmap identifying layers whose neurons strongly encode the movie-title concept.
+### Combined Sensitivity Analysis
+<p align="center">
+<img width="975" height="592" src="https://github.com/user-attachments/assets/62fb3c65-beb1-4995-8534-5eb645521956" />
+</p>
+**Figure 5.** Layerwise correlation of saliency, Fisher information, and activation similarity identifies a consistent high-impact region in mid-model layers.
 ---
+# Phase 3 — Semantic Similarity vs Activation Structure
+We measure whether semantic similarity across prompts matches activation-level similarity.
+<p align="center">
+<img width="975" height="797" src="https://github.com/user-attachments/assets/1de1b763-c637-4d2d-b9f6-d0afcb002748" />
+</p>
+**Figure 6.** Semantic similarity (top) vs activation overlap (bottom).
+Prompts that *mean the same thing* do *not* necessarily activate the same neurons—revealing a root cause of preference drift.
 ---
+# Phase 4 — Fisher Information Profiling
+<p align="center">
+<img width="975" height="403" src="https://github.com/user-attachments/assets/96306821-75d8-49c8-9d5f-26906a6d48e1" />
+</p>
+**Figure 7.** Mean gradient norm per layer, pinpointing where the model is most sensitive.
+<p align="center">
+<img width="975" height="511" src="https://github.com/user-attachments/assets/e05e5312-1a11-4f1e-b199-1ef3574504a8" />
+</p>
+**Figure 8.** Fisher information heatmap showing which neurons maintain the highest influence on movie-title generation.
 ---
+# Phase 5 — Activation-Guided Masked LoRA (AG-Masked-LoRA)
+A low-rank update is selectively applied *only* to neurons identified as encoding the targeted concept.
+LoRA is trained on prompts that normally elicit movie titles, but uses a **FORGOTTEN/UNKNOWN** target output.
+The update is masked to affect only sensitive neurons, leaving the rest of the model untouched.
 <p align="center">
+<img width="975" height="61" src="https://github.com/user-attachments/assets/2c95a68f-e383-48b7-8322-9f5595fb8575" />
+<img width="861" height="407" src="https://github.com/user-attachments/assets/6de06210-da85-4958-986e-5085c5bd5a93" />
 </p>
+**Figure 9.** Incremental unlearning logs showing loss reduction while applying masked LoRA updates.
+---
+# Phase 6 — Evaluation: Before/After Unlearning
+### Base Model (Before/After)
 <p align="center">
+<img width="975" height="456" src="https://github.com/user-attachments/assets/94f51feb-b1f2-4f48-b6d2-3d6a70a72205" />
 </p>
+### Unlearned Model (Before/After)
 <p align="center">
+<img width="975" height="219" src="https://github.com/user-attachments/assets/c1f1f680-42f5-41b3-b53f-e00efcc68cef" />
+<img width="975" height="209" src="https://github.com/user-attachments/assets/d9242c27-3ccc-490d-a18e-4439424d2911" />
 </p>
+**Figure 10–11.** The unlearned model consistently returns FORGOTTEN/UNKNOWN across paraphrased prompts.
+### Direct Concept Probing (Before/After)
 <p align="center">
+<img width="975" height="238" src="https://github.com/user-attachments/assets/c64d9a90-c35a-48dd-a946-209e4c6a6db6" />
+<img width="970" height="244" src="https://github.com/user-attachments/assets/789d64a7-1361-4d0f-9d41-23259c920376" />
 </p>
+**Figure 12.** Even when asked explicitly about “Inception,” the model no longer retrieves or describes it—indicating true concept removal.
 ---
+# Final Findings
+Our experiments confirm that AG-Masked-LoRA performs **structural semantic unlearning**, not superficial keyword suppression.
+### Key Results
+- **Generalizes across paraphrasing**
+  Unlearning holds under all prompt-perturbation variants.
+- **Consistent neuron clusters identified**
+  Saliency + Fisher converge on the same mid-model layers.
+- **Clear activation shift**
+  PCA and activation distance show pre/post separation.
+- **Global reasoning preserved**
+  No degradation in unrelated tasks or instruction following.
+- **Deployment ready**
+  Runs cleanly on A100, L4, and Jetson Orin.
+### Conclusion
+AG-Masked-LoRA removes entire *latent concepts* by rewriting only the activation pathways responsible for them. This makes it suitable for:
+- Safety-critical filtering
 - Policy enforcement
+- Copyright-restricted retrieval removal
+- Reversible domain-specific behavior modules
+The base model remains unmodified—only a small adapter controls the behavior.
 ---
+# Project Resources and Repositories
+### GitHub — Full Source Code
+**LLM Preference Unlearning (Activation-Level Framework)**
+https://github.com/rameyjm7/llm-preference-unlearning
+Includes:
+- Modular notebooks (00–08)
+- Unified pipeline notebook
+- Activation probe scripts
+- Saliency, gradient, Fisher analysis
+- Incremental unlearning engine
+- Figures and logs
+### HuggingFace — Model Card & Artifacts
+**Activation-Level Preference Unlearning (HF)**
+https://huggingface.co/rameyjm7/llm-preference-unlearning
+Includes:
+- Model card
+- Figures and evaluation
+- Adapter artifacts (optional)
+- Notebook links
+---
+## License
 MIT License.