19 44 19

Xiangtai Li

LXT

https://lxtgh.github.io/

AI & ML interests

Computer Vision, Multi-Modal Understanding, Generative AI

Recent Activity

upvoted a paper 3 days ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

upvoted a paper 18 days ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

upvoted a paper 19 days ago

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

View all activity

Organizations

upvoted a paper 3 days ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Paper • 2512.02457 • Published 5 days ago • 12

upvoted a paper 18 days ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Paper • 2511.13853 • Published 19 days ago • 34

upvoted a paper 19 days ago

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published 24 days ago • 68

upvoted a paper 26 days ago

Visual Spatial Tuning

Paper • 2511.05491 • Published 29 days ago • 49

upvoted 5 papers about 1 month ago

PairUni: Pairwise Training for Unified Multimodal Language Models

Paper • 2510.25682 • Published Oct 29 • 13

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30 • 33

upvoted a paper about 2 months ago

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21 • 36

upvoted a collection about 2 months ago

Sa2VA Model Zoo

Collection

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 10 days ago • 44

upvoted 3 papers about 2 months ago

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

Diffusion Transformers with Representation Autoencoders

Paper • 2510.11690 • Published Oct 13 • 165

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Paper • 2510.11712 • Published Oct 13 • 30

upvoted a paper 3 months ago

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11 • 32

liked 3 models 4 months ago

ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8 • 57 • 31

ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8 • 692 • 29

ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8 • 785 • 65

authored 2 papers 5 months ago

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

Paper • 2505.24164 • Published May 30

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions

Paper • 2506.13691 • Published Jun 16 • 2

Xiangtai Li

AI & ML interests

Recent Activity

Organizations

LXT's activity