metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
- multimodal
- video-understanding
- spatial-reasoning
- vision-language
datasets:
- nyu-visionx/VSI-590K
model-index:
- name: Cambrian-S-7B
results:
- task:
type: visual-question-answering
name: VSI-Bench
dataset:
type: vsi-bench
name: VSI-Bench
metrics:
- type: accuracy
name: accuracy
value: 67.5
- task:
type: visual-question-answering
name: Tomato
dataset:
type: Tomato
name: Tomato
metrics:
- type: accuracy
name: accuracy
value: 27
- task:
type: visual-question-answering
name: HourVideo
dataset:
type: hourvideo
name: HourVideo
metrics:
- type: accuracy
name: accuracy
value: 36.5
- task:
type: visual-question-answering
name: EgoSchema
dataset:
type: egoschema
name: EgoSchema
metrics:
- type: accuracy
name: accuracy
value: 76.8
- task:
type: visual-question-answering
name: Perception Test
dataset:
type: perception-test
name: Perception Test
metrics:
- type: accuracy
name: accuracy
value: 69.9
- task:
type: visual-question-answering
name: VideoMME
dataset:
type: videomme
name: VideoMME
metrics:
- type: accuracy
name: accuracy
value: 63.4
- task:
type: visual-question-answering
name: MVBench
dataset:
type: mvbench
name: MVBench
metrics:
- type: accuracy
name: accuracy
value: 64.5
- task:
type: visual-question-answering
name: LongVideoBench
dataset:
type: longvideobench
name: LongVideoBench
metrics:
- type: accuracy
name: accuracy
value: 59.4
- task:
type: visual-question-answering
name: VideoMMMU
dataset:
type: videommmu
name: VideoMMMU
metrics:
- type: accuracy
name: accuracy
value: 38.6
- task:
type: visual-question-answering
name: MMVP
dataset:
type: mmvp
name: MMVP
metrics:
- type: accuracy
name: accuracy
value: 60
- task:
type: visual-question-answering
name: 3DSR
dataset:
type: 3dsr
name: 3DSR
metrics:
- type: accuracy
name: accuracy
value: 54.8
- task:
type: visual-question-answering
name: CV-Bench
dataset:
type: cv-bench
name: CV-Bench
metrics:
- type: accuracy
name: accuracy
value: 76.9
language:
- en
Cambrian-S-7B
Website | Paper | GitHub | Cambrian-S Family
Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.
Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.
Model Details
- Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
- Parameters: 7B
- Vision Encoder: SigLIP-384 (SiGLIP)
- Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
- Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
Usage
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates
model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")
# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
Citation
@article{yang2025cambrian,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
journal={arXiv preprint arXiv:2025},
year={2025}
}