Cambrian-S-7B / README.md
EdwinHuang's picture
Update README.md
556e72f verified
metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
  - multimodal
  - video-understanding
  - spatial-reasoning
  - vision-language
datasets:
  - nyu-visionx/VSI-590K
model-index:
  - name: Cambrian-S-7B
    results:
      - task:
          type: visual-question-answering
          name: VSI-Bench
        dataset:
          type: vsi-bench
          name: VSI-Bench
        metrics:
          - type: accuracy
            name: accuracy
            value: 67.5
      - task:
          type: visual-question-answering
          name: Tomato
        dataset:
          type: Tomato
          name: Tomato
        metrics:
          - type: accuracy
            name: accuracy
            value: 27
      - task:
          type: visual-question-answering
          name: HourVideo
        dataset:
          type: hourvideo
          name: HourVideo
        metrics:
          - type: accuracy
            name: accuracy
            value: 36.5
      - task:
          type: visual-question-answering
          name: EgoSchema
        dataset:
          type: egoschema
          name: EgoSchema
        metrics:
          - type: accuracy
            name: accuracy
            value: 76.8
      - task:
          type: visual-question-answering
          name: Perception Test
        dataset:
          type: perception-test
          name: Perception Test
        metrics:
          - type: accuracy
            name: accuracy
            value: 69.9
      - task:
          type: visual-question-answering
          name: VideoMME
        dataset:
          type: videomme
          name: VideoMME
        metrics:
          - type: accuracy
            name: accuracy
            value: 63.4
      - task:
          type: visual-question-answering
          name: MVBench
        dataset:
          type: mvbench
          name: MVBench
        metrics:
          - type: accuracy
            name: accuracy
            value: 64.5
      - task:
          type: visual-question-answering
          name: LongVideoBench
        dataset:
          type: longvideobench
          name: LongVideoBench
        metrics:
          - type: accuracy
            name: accuracy
            value: 59.4
      - task:
          type: visual-question-answering
          name: VideoMMMU
        dataset:
          type: videommmu
          name: VideoMMMU
        metrics:
          - type: accuracy
            name: accuracy
            value: 38.6
      - task:
          type: visual-question-answering
          name: MMVP
        dataset:
          type: mmvp
          name: MMVP
        metrics:
          - type: accuracy
            name: accuracy
            value: 60
      - task:
          type: visual-question-answering
          name: 3DSR
        dataset:
          type: 3dsr
          name: 3DSR
        metrics:
          - type: accuracy
            name: accuracy
            value: 54.8
      - task:
          type: visual-question-answering
          name: CV-Bench
        dataset:
          type: cv-bench
          name: CV-Bench
        metrics:
          - type: accuracy
            name: accuracy
            value: 76.9
language:
  - en

Cambrian-S-7B

Website | Paper | GitHub | Cambrian-S Family

Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.

Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.

Model Details

  • Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
  • Parameters: 7B
  • Vision Encoder: SigLIP-384 (SiGLIP)
  • Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
  • Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

Usage

from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates

model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")

# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)

Citation

@article{yang2025cambrian,
  title={Cambrian-S: Towards Spatial Supersensing in Video},
  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
  journal={arXiv preprint arXiv:2025},
  year={2025}
}