OneThinker: All-in-one Reasoning Model for Image and Video

This repository contains the SFT model presented in: OneThinker: All-in-one Reasoning Model for Image and Video

This is an intermediate model prepared for subsequent RL training.

For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the OneThinker GitHub repository.

👀 About OneThinker

We introduce OneThinker, an all-in-one multimodal reasoning generalist that is capable of thinking across a wide range of fundamental visual tasks within a single model.

OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale OneThinker-600k multi-task training corpus and build OneThinker-SFT-340k with high-quality CoT annotations for SFT cold start. Furthermore, we propose EMA-GRPO, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.

OneThinker demonstrates strong performance on 31 benchmarks across 10 fundamental vision tasks, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.

📄 Citations

If you find our work helpful for your research, please consider citing our work.

@article{feng2025onethinker,
  title={OneThinker: All-in-one Reasoning Model for Image and Video},
  author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},
  journal={arXiv preprint arXiv:2512.03043},
  year={2025}
}

Downloads last month: 29

Safetensors

Model size

770k params

Tensor type

BF16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OneThink/OneThinker-SFT-Qwen3-8B

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(79)

this model

OneThink
/

OneThinker-SFT-Qwen3-8B

OneThinker: All-in-one Reasoning Model for Image and Video

👀 About OneThinker

📄 Citations

Model tree for OneThink/OneThinker-SFT-Qwen3-8B

Dataset used to train OneThink/OneThinker-SFT-Qwen3-8B