InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Abstract
O3-Bench evaluates multimodal reasoning with interleaved attention to visual details, while InSight-o3 uses a multi-agent framework to improve performance through specialized visual search and reasoning tasks.
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Monet: Reasoning in Latent Visual Space Beyond Images and Language (2025)
- Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning (2025)
- GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models (2025)
- Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images (2025)
- MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models (2025)
- DeepEyesV2: Toward Agentic Multimodal Model (2025)
- TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper