OneThinker: All-in-one Reasoning Model for Image and Video
This repository contains the SFT model presented in: OneThinker: All-in-one Reasoning Model for Image and Video
This is an intermediate model prepared for subsequent RL training.
For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the OneThinker GitHub repository.
π About OneThinker
We introduce OneThinker, an all-in-one multimodal reasoning generalist that is capable of thinking across a wide range of fundamental visual tasks within a single model.
OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale OneThinker-600k multi-task training corpus and build OneThinker-SFT-340k with high-quality CoT annotations for SFT cold start. Furthermore, we propose EMA-GRPO, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
OneThinker demonstrates strong performance on 31 benchmarks across 10 fundamental vision tasks, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
π Citations
If you find our work helpful for your research, please consider citing our work.
@article{feng2025onethinker,
title={OneThinker: All-in-one Reasoning Model for Image and Video},
author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},
journal={arXiv preprint arXiv:2512.03043},
year={2025}
}
- Downloads last month
- 29
Model tree for OneThink/OneThinker-SFT-Qwen3-8B
Base model
Qwen/Qwen3-VL-8B-Instruct