xinxin66
/

RepBlend

Model card Files Files and versions

xet

Community

xinxin66 commited on Sep 23

Commit

1958f30

verified ·

1 Parent(s): e772d62

Update README.md

Browse files

Files changed (1) hide show

README.md +104 -3

README.md CHANGED Viewed

@@ -1,3 +1,104 @@
----
-license: mit
----

+---
+license: mit
+---
+# 🌟 Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
+# NeurIPS 2025 (Rating: 4445)
+> [Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation](https://arxiv.org/pdf/2505.14705?).<br>
+> [Xin Zhang](https://zhangxin-xd.github.io/), Ziruo Zhang, [Jiawei Du](https://scholar.google.com/citations?user=WrJKEzEAAAAJ&hl=zh-CN), [Zuozhu Liu](https://person.zju.edu.cn/en/lzz), [Joey Tianyi Zhou](https://joeyzhouty.github.io/) <br>
+> Agency for Science, Technology, and Research (ASTAR), Singapore <br>
+> National University of Singapore, Singapore <br>
+> Zhejiang University, China
+## 📖 Introduction
+<p align="center">
+  <img src="imgs/problem.png" alt="problem" title="problem" width="700">
+</p>
+<p align="justify">
+  <strong> Multimodal embedding distributions across various distillation methods </strong>:
+  We extract image and text embeddings from a finetuned CLIP and project them into a shared representation space using DOSNES.
+  Red triangles and blue circles denote image and text embeddings, respectively.
+  Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution.
+  Middle: The distilled dataset generated by a sota MDD method (LoRS) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions.
+  Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
+</p>
+## ⚙️ Installation
+To get started, follow these instructions to set up the environment and install dependencies.
+1. **Clone this repository**:
+    ```bash
+    git clone https://github.com/zhangxin-xd/RepBlend.git
+    cd RepBlend
+    ```
+2. **Install required packages**:
+    ```
+    conda create -n RepBlend python=3.10
+    conda activate RepBlend
+    pip install -r requirements.txt
+    ```
+---
+## 🚀 Usage
+Here’s how to use RepBlend for Multimodal Dataset Distillation:
+### Pretrained Weights
+The checkpoints for all experimental networks are available from their respective official repositories. For convenience, we have also provided them together [here](https://huggingface.co/xinxin66/RepBlend).
+Once downloaded, put them in `distill_utils/checkpoints/`.
+### Experimental Datasets
+The dataset hase been validated on various benchmarks, you can download from  their respective links. Once downloaded, put them in `distill_utils/data/`.
+| datasets | links|
+|-----|-----|
+| Flickr30K | [images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [annotations](https://huggingface.co/xinxin66/RepBlend/)|
+| COCO | [images](https://cocodataset.org/#download), [annotations](https://huggingface.co/xinxin66/RepBlend) |
+|LLaVA-cc3m|[images](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md), [annotations](https://huggingface.co/xinxin66/RepBlend)|
+### Generate Expert Trajectories
+You can generate expert trajectories by running the `scripts/buffer.sh`, or alternatively, download our [pre-generated trajectories](https://huggingface.co/xinxin66/RepBlend) for faster reproduction.
+```
+bash scripts/buffer.sh
+```
+### Distill Multimodal Dataset
+You can distill multimodal datasets with RepBlend by running `scripts/distill_coco_repblend.sh` and `scripts/distill_flickr_repblend.sh`.
+```
+bash scripts/distill_coco_repblend.sh
+bash scripts/distill_flickr_repblend.sh
+```
+## 📊 Results
+Our experiments demonstrate the effectiveness of the proposed approach across various benchmarks.
+<div style="display: flex; justify-content: center; align-items: center;">
+    <img src="imgs/results 1.png" alt="Results 1" width="800"/>
+</div>
+<br>
+<div style="display: flex; justify-content: center; align-items: center;">
+    <img src="imgs/table 1.png" alt="table 1" width="400"/>
+    <img src="imgs/table 2.png" alt="table 2" width="400"/>
+</div>
+For detailed experimental results and further analysis, please refer to the full paper.
+---
+## 📑 Citation
+If you find this code useful in your research, please consider citing our work:
+```bibtex
+@inproceedings{RepBlend2025neurips,
+    title={Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation},
+    author={Zhang, Xin and Zhang, Ziruo, and Du, Jiawei and Liu, Zuozhu and Zhou, Joey Tianyi},
+    booktitle={Adv. Neural Inf. Process. Syst. (NeurIPS)},
+    year={2025}
+}
+```
+---
+## 🎉 Reference
+Our code has referred to previous works:
+- [LoRS: Low-Rank Similarity Mining](https://github.com/silicx/LoRS_Distill)
+- [Vision-Language Dataset Distillation](https://github.com/princetonvisualai/multimodal_dataset_distillation)
+- [Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (TESLA)](https://github.com/justincui03/tesla)