AdaptVision
/

AdaptVision-7B

 ---
 license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
+# AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
+AdaptVision is an efficient Vision-Language Model (VLM) paradigm designed to achieve adaptive visual token acquisition through a coarse-to-fine approach. Inspired by human active vision mechanisms, this model addresses the significant computational overhead in VLMs by autonomously determining the minimum number of visual tokens required for each sample. It selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary.
+The model was presented in the paper:
+[AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition](https://arxiv.org/abs/2512.03794)
+For more details, please visit the [project page](https://adaptvision.github.io/).
+The official code can be found on the [GitHub repository](https://github.com/AdaptVision/AdaptVision).
+## Citation
+If you find this project useful in your research, please consider citing:
+```bibtex
+@article{lin2025adapt,
+  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
+  author={Zichuan Lin and Yicheng Liu and Yang Yang and Lvfang Tao and Deheng Ye},
+  journal={arXiv preprint arXiv:2512.03794},
+  year={2025}
+}
+```