Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,104 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
# π Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
|
| 5 |
+
# NeurIPS 2025 (Rating: 4445)
|
| 6 |
+
> [Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation](https://arxiv.org/pdf/2505.14705?).<br>
|
| 7 |
+
> [Xin Zhang](https://zhangxin-xd.github.io/), Ziruo Zhang, [Jiawei Du](https://scholar.google.com/citations?user=WrJKEzEAAAAJ&hl=zh-CN), [Zuozhu Liu](https://person.zju.edu.cn/en/lzz), [Joey Tianyi Zhou](https://joeyzhouty.github.io/) <br>
|
| 8 |
+
> Agency for Science, Technology, and Research (ASTAR), Singapore <br>
|
| 9 |
+
> National University of Singapore, Singapore <br>
|
| 10 |
+
> Zhejiang University, China
|
| 11 |
+
## π Introduction
|
| 12 |
+
<p align="center">
|
| 13 |
+
<img src="imgs/problem.png" alt="problem" title="problem" width="700">
|
| 14 |
+
</p>
|
| 15 |
+
|
| 16 |
+
<p align="justify">
|
| 17 |
+
<strong> Multimodal embedding distributions across various distillation methods </strong>:
|
| 18 |
+
We extract image and text embeddings from a finetuned CLIP and project them into a shared representation space using DOSNES.
|
| 19 |
+
Red triangles and blue circles denote image and text embeddings, respectively.
|
| 20 |
+
Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution.
|
| 21 |
+
Middle: The distilled dataset generated by a sota MDD method (LoRS) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions.
|
| 22 |
+
Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
|
| 23 |
+
</p>
|
| 24 |
+
|
| 25 |
+
## βοΈ Installation
|
| 26 |
+
|
| 27 |
+
To get started, follow these instructions to set up the environment and install dependencies.
|
| 28 |
+
|
| 29 |
+
1. **Clone this repository**:
|
| 30 |
+
```bash
|
| 31 |
+
git clone https://github.com/zhangxin-xd/RepBlend.git
|
| 32 |
+
cd RepBlend
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
2. **Install required packages**:
|
| 36 |
+
```
|
| 37 |
+
conda create -n RepBlend python=3.10
|
| 38 |
+
conda activate RepBlend
|
| 39 |
+
pip install -r requirements.txt
|
| 40 |
+
```
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## π Usage
|
| 44 |
+
|
| 45 |
+
Hereβs how to use RepBlend for Multimodal Dataset Distillation:
|
| 46 |
+
### Pretrained Weights
|
| 47 |
+
The checkpoints for all experimental networks are available from their respective official repositories. For convenience, we have also provided them together [here](https://huggingface.co/xinxin66/RepBlend).
|
| 48 |
+
Once downloaded, put them in `distill_utils/checkpoints/`.
|
| 49 |
+
|
| 50 |
+
### Experimental Datasets
|
| 51 |
+
The dataset hase been validated on various benchmarks, you can download from their respective links. Once downloaded, put them in `distill_utils/data/`.
|
| 52 |
+
| datasets | links|
|
| 53 |
+
|-----|-----|
|
| 54 |
+
| Flickr30K | [images](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [annotations](https://huggingface.co/xinxin66/RepBlend/)|
|
| 55 |
+
| COCO | [images](https://cocodataset.org/#download), [annotations](https://huggingface.co/xinxin66/RepBlend) |
|
| 56 |
+
|LLaVA-cc3m|[images](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md), [annotations](https://huggingface.co/xinxin66/RepBlend)|
|
| 57 |
+
|
| 58 |
+
### Generate Expert Trajectories
|
| 59 |
+
You can generate expert trajectories by running the `scripts/buffer.sh`, or alternatively, download our [pre-generated trajectories](https://huggingface.co/xinxin66/RepBlend) for faster reproduction.
|
| 60 |
+
```
|
| 61 |
+
bash scripts/buffer.sh
|
| 62 |
+
```
|
| 63 |
+
### Distill Multimodal Dataset
|
| 64 |
+
You can distill multimodal datasets with RepBlend by running `scripts/distill_coco_repblend.sh` and `scripts/distill_flickr_repblend.sh`.
|
| 65 |
+
```
|
| 66 |
+
bash scripts/distill_coco_repblend.sh
|
| 67 |
+
bash scripts/distill_flickr_repblend.sh
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## π Results
|
| 71 |
+
|
| 72 |
+
Our experiments demonstrate the effectiveness of the proposed approach across various benchmarks.
|
| 73 |
+
<div style="display: flex; justify-content: center; align-items: center;">
|
| 74 |
+
<img src="imgs/results 1.png" alt="Results 1" width="800"/>
|
| 75 |
+
</div>
|
| 76 |
+
<br>
|
| 77 |
+
<div style="display: flex; justify-content: center; align-items: center;">
|
| 78 |
+
<img src="imgs/table 1.png" alt="table 1" width="400"/>
|
| 79 |
+
<img src="imgs/table 2.png" alt="table 2" width="400"/>
|
| 80 |
+
</div>
|
| 81 |
+
|
| 82 |
+
For detailed experimental results and further analysis, please refer to the full paper.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## π Citation
|
| 87 |
+
|
| 88 |
+
If you find this code useful in your research, please consider citing our work:
|
| 89 |
+
|
| 90 |
+
```bibtex
|
| 91 |
+
@inproceedings{RepBlend2025neurips,
|
| 92 |
+
title={Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation},
|
| 93 |
+
author={Zhang, Xin and Zhang, Ziruo, and Du, Jiawei and Liu, Zuozhu and Zhou, Joey Tianyi},
|
| 94 |
+
booktitle={Adv. Neural Inf. Process. Syst. (NeurIPS)},
|
| 95 |
+
year={2025}
|
| 96 |
+
}
|
| 97 |
+
```
|
| 98 |
+
---
|
| 99 |
+
## π Reference
|
| 100 |
+
Our code has referred to previous works:
|
| 101 |
+
- [LoRS: Low-Rank Similarity Mining](https://github.com/silicx/LoRS_Distill)
|
| 102 |
+
- [Vision-Language Dataset Distillation](https://github.com/princetonvisualai/multimodal_dataset_distillation)
|
| 103 |
+
- [Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (TESLA)](https://github.com/justincui03/tesla)
|
| 104 |
+
|