AI & ML interests
None defined yet.
Recent Activity
View all activity
Papers
TV2TV: A Unified Framework for Interleaved Language and Video Generation
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Collection for Code World Model, an agentic coding model from FAIR.
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann
-
facebook/vjepa2-vitl-fpc64-256
Video Classification • 0.3B • Updated • 104k • 162 -
facebook/vjepa2-vith-fpc64-256
Video Classification • 0.7B • Updated • 4.64k • 13 -
facebook/vjepa2-vitg-fpc64-256
Video Classification • 1B • Updated • 30.6k • 21 -
facebook/vjepa2-vitg-fpc64-384
Video Classification • 1B • Updated • 1.22k • 33
A collection of small (sub-1B) multilingual dense retrievers that generalize well across a number of tasks and languages.
Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (ICML 2024) https://arxiv.org/abs/2402.14905
Models continually pretrained using LayerSkip - https://arxiv.org/abs/2404.16710
-
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 80 -
facebook/layerskip-llama2-7B
Text Generation • 7B • Updated • 803 • 15 -
facebook/layerskip-llama2-13B
Text Generation • 13B • Updated • 269 • 5 -
facebook/layerskip-llama2-70B
Text Generation • 69B • Updated • 318 • 5
A significant step towards removing language barriers through expressive, fast and high-quality AI translation.
-
Seamless: Multilingual Expressive and Streaming Speech Translation
Paper • 2312.05187 • Published • 14 -
facebook/seamless-m4t-v2-large
Automatic Speech Recognition • 2B • Updated • 53.3k • 927 -
Seamless M4T v2
📞516Translate speech and text between languages
-
facebook/seamless-expressive
Text-to-Speech • Updated • 187
A collection for the first release of Wav2Vec 2.0, a speech encoder that learns powerful representations from unlabelled audio data.
-
facebook/wav2vec2-large-960h-lv60-self
Automatic Speech Recognition • Updated • 47.1k • 155 -
facebook/wav2vec2-large-960h
Automatic Speech Recognition • Updated • 15.4k • 32 -
facebook/wav2vec2-base-960h
Automatic Speech Recognition • 94.4M • Updated • 1.96M • 383 -
facebook/wav2vec2-base-100h
Automatic Speech Recognition • Updated • 1.8k • 7
A collection of multilingual Wav2Vec 2.0 checkpoints pre-trained on 53 languages and fine-tuned for CTC speech recognition.
-
facebook/wav2vec2-large-xlsr-53
Updated • 447k • 149 -
facebook/wav2vec2-xlsr-53-espeak-cv-ft
Automatic Speech Recognition • Updated • 219k • 41 -
facebook/wav2vec2-large-xlsr-53-dutch
Automatic Speech Recognition • Updated • 151 • 3 -
facebook/wav2vec2-large-xlsr-53-french
Automatic Speech Recognition • Updated • 4.09k • 13
A collection of "robust" Wav2Vec 2.0 checkpoints pre-trained on datasets from multiple domains.
-
facebook/wav2vec2-large-robust
Updated • 2.18k • 37 -
facebook/wav2vec2-large-robust-ft-libri-960h
Automatic Speech Recognition • 0.3B • Updated • 173k • 15 -
facebook/wav2vec2-large-robust-ft-swbd-300h
Automatic Speech Recognition • Updated • 3.62k • 20 -
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Paper • 2104.01027 • Published • 1
A collection of checkpoints from the second VoxPopuli release.
-
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-bg-voxpopuli-v2
Automatic Speech Recognition • Updated • 12 • 2 -
facebook/wav2vec2-base-cs-voxpopuli-v2
Automatic Speech Recognition • Updated • 10 • 1 -
facebook/wav2vec2-base-da-voxpopuli-v2
Automatic Speech Recognition • Updated • 8
Text-to-speech models from fairseq s^2
A collection of stereo music generation models as part of the v2 MusicGen release.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
OPT (Open Pretrained Transformer) is a series of open-sourced large causal language models which perform similar in performance to GPT3.
-
facebook/metaclip-2-worldwide-huge-quickgelu
Zero-Shot Image Classification • 2B • Updated • 20.5k • 13 -
facebook/metaclip-2-worldwide-huge-378
Zero-Shot Image Classification • 2B • Updated • 460 • 6 -
facebook/metaclip-2-worldwide-giant
Zero-Shot Image Classification • 4B • Updated • 2.1k • 7 -
facebook/metaclip-2-worldwide-giant-378
Zero-Shot Image Classification • 4B • Updated • 961 • 11
MobileLLM-R1, a series of sub-billion parameter reasoning models
DINOv3: foundation models producing excellent dense features, outperforming SotA w/o fine-tuning - https://arxiv.org/abs/2508.10104
-
facebook/dinov3-vit7b16-pretrain-lvd1689m
Image Feature Extraction • 7B • Updated • 26.5k • 195 -
facebook/dinov3-vits16-pretrain-lvd1689m
Image Feature Extraction • 21.6M • Updated • 278k • 54 -
facebook/dinov3-convnext-small-pretrain-lvd1689m
Image Feature Extraction • 49.5M • Updated • 20.1k • 21 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 321k • 84
Scaling CLIP data with transparent training distribution from an end-to-end pipeline.
-
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 19.7k • 46 -
facebook/metaclip-l14-fullcc2.5b
Zero-Shot Image Classification • Updated • 1.43k • 7 -
facebook/metaclip-b16-fullcc2.5b
Zero-Shot Image Classification • Updated • 5.73k • 11 -
facebook/metaclip-b32-fullcc2.5b
Zero-Shot Image Classification • Updated • 411 • 9
-
facebook/webssl-dino300m-full2b-224
Image Feature Extraction • 0.3B • Updated • 6.54k • 10 -
facebook/webssl-dino1b-full2b-224
Image Feature Extraction • 1B • Updated • 1.04k • 3 -
facebook/webssl-dino2b-full2b-224
Image Feature Extraction • 2B • Updated • 286 -
facebook/webssl-dino3b-full2b-224
Image Feature Extraction • 3B • Updated • 326
A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.
Models and datasets for Sparsh: Self-supervised touch representations for vision-based tactile sensing
MelodyFlow: High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Masked Audio Generation using a Single Non-Autoregressive Transformer
SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly.
First release checkpoints for XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
A collection of open-source artefacts (datasets + checkpoints) from the first VoxPopuli release.
-
facebook/voxpopuli
Updated • 9.55k • 138 -
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-100k-voxpopuli
Automatic Speech Recognition • Updated • 66 • 4 -
facebook/wav2vec2-base-10k-voxpopuli-ft-cs
Automatic Speech Recognition • Updated • 23
A collection of checkpoints from the HuBERT release, a speech encoder that learns powerful representations from unlabelled audio data.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Paper • 2106.07447 • Published • 4 -
facebook/hubert-base-ls960
Feature Extraction • Updated • 1.48M • 65 -
facebook/hubert-large-ll60k
Feature Extraction • Updated • 85.6k • • 30 -
facebook/hubert-large-ls960-ft
Automatic Speech Recognition • Updated • 330k • 75
DINOv2: foundation models producing robust visual features suitable for image-level and pixel-level visual tasks - https://arxiv.org/abs/2304.07193
-
facebook/dinov2-small
Image Feature Extraction • 22.1M • Updated • 1.69M • 51 -
facebook/dinov2-base
Image Feature Extraction • 86.6M • Updated • 1.39M • 160 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 555k • 95 -
facebook/dinov2-giant
Image Feature Extraction • 1B • Updated • 124k • 54
Meta LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning.
Foundation models for human tasks. Code: https://github.com/facebookresearch/sapiens
-
facebook/metaclip-2-worldwide-huge-quickgelu
Zero-Shot Image Classification • 2B • Updated • 20.5k • 13 -
facebook/metaclip-2-worldwide-huge-378
Zero-Shot Image Classification • 2B • Updated • 460 • 6 -
facebook/metaclip-2-worldwide-giant
Zero-Shot Image Classification • 4B • Updated • 2.1k • 7 -
facebook/metaclip-2-worldwide-giant-378
Zero-Shot Image Classification • 4B • Updated • 961 • 11
MobileLLM-R1, a series of sub-billion parameter reasoning models
Collection for Code World Model, an agentic coding model from FAIR.
DINOv3: foundation models producing excellent dense features, outperforming SotA w/o fine-tuning - https://arxiv.org/abs/2508.10104
-
facebook/dinov3-vit7b16-pretrain-lvd1689m
Image Feature Extraction • 7B • Updated • 26.5k • 195 -
facebook/dinov3-vits16-pretrain-lvd1689m
Image Feature Extraction • 21.6M • Updated • 278k • 54 -
facebook/dinov3-convnext-small-pretrain-lvd1689m
Image Feature Extraction • 49.5M • Updated • 20.1k • 21 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 321k • 84
Scaling CLIP data with transparent training distribution from an end-to-end pipeline.
-
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 19.7k • 46 -
facebook/metaclip-l14-fullcc2.5b
Zero-Shot Image Classification • Updated • 1.43k • 7 -
facebook/metaclip-b16-fullcc2.5b
Zero-Shot Image Classification • Updated • 5.73k • 11 -
facebook/metaclip-b32-fullcc2.5b
Zero-Shot Image Classification • Updated • 411 • 9
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann
-
facebook/vjepa2-vitl-fpc64-256
Video Classification • 0.3B • Updated • 104k • 162 -
facebook/vjepa2-vith-fpc64-256
Video Classification • 0.7B • Updated • 4.64k • 13 -
facebook/vjepa2-vitg-fpc64-256
Video Classification • 1B • Updated • 30.6k • 21 -
facebook/vjepa2-vitg-fpc64-384
Video Classification • 1B • Updated • 1.22k • 33
-
facebook/webssl-dino300m-full2b-224
Image Feature Extraction • 0.3B • Updated • 6.54k • 10 -
facebook/webssl-dino1b-full2b-224
Image Feature Extraction • 1B • Updated • 1.04k • 3 -
facebook/webssl-dino2b-full2b-224
Image Feature Extraction • 2B • Updated • 286 -
facebook/webssl-dino3b-full2b-224
Image Feature Extraction • 3B • Updated • 326
A collection of small (sub-1B) multilingual dense retrievers that generalize well across a number of tasks and languages.
A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.
Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (ICML 2024) https://arxiv.org/abs/2402.14905
Models and datasets for Sparsh: Self-supervised touch representations for vision-based tactile sensing
Models continually pretrained using LayerSkip - https://arxiv.org/abs/2404.16710
-
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 80 -
facebook/layerskip-llama2-7B
Text Generation • 7B • Updated • 803 • 15 -
facebook/layerskip-llama2-13B
Text Generation • 13B • Updated • 269 • 5 -
facebook/layerskip-llama2-70B
Text Generation • 69B • Updated • 318 • 5
MelodyFlow: High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
A significant step towards removing language barriers through expressive, fast and high-quality AI translation.
-
Seamless: Multilingual Expressive and Streaming Speech Translation
Paper • 2312.05187 • Published • 14 -
facebook/seamless-m4t-v2-large
Automatic Speech Recognition • 2B • Updated • 53.3k • 927 -
Seamless M4T v2
📞516Translate speech and text between languages
-
facebook/seamless-expressive
Text-to-Speech • Updated • 187
Masked Audio Generation using a Single Non-Autoregressive Transformer
A collection for the first release of Wav2Vec 2.0, a speech encoder that learns powerful representations from unlabelled audio data.
-
facebook/wav2vec2-large-960h-lv60-self
Automatic Speech Recognition • Updated • 47.1k • 155 -
facebook/wav2vec2-large-960h
Automatic Speech Recognition • Updated • 15.4k • 32 -
facebook/wav2vec2-base-960h
Automatic Speech Recognition • 94.4M • Updated • 1.96M • 383 -
facebook/wav2vec2-base-100h
Automatic Speech Recognition • Updated • 1.8k • 7
SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly.
A collection of multilingual Wav2Vec 2.0 checkpoints pre-trained on 53 languages and fine-tuned for CTC speech recognition.
-
facebook/wav2vec2-large-xlsr-53
Updated • 447k • 149 -
facebook/wav2vec2-xlsr-53-espeak-cv-ft
Automatic Speech Recognition • Updated • 219k • 41 -
facebook/wav2vec2-large-xlsr-53-dutch
Automatic Speech Recognition • Updated • 151 • 3 -
facebook/wav2vec2-large-xlsr-53-french
Automatic Speech Recognition • Updated • 4.09k • 13
First release checkpoints for XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
A collection of "robust" Wav2Vec 2.0 checkpoints pre-trained on datasets from multiple domains.
-
facebook/wav2vec2-large-robust
Updated • 2.18k • 37 -
facebook/wav2vec2-large-robust-ft-libri-960h
Automatic Speech Recognition • 0.3B • Updated • 173k • 15 -
facebook/wav2vec2-large-robust-ft-swbd-300h
Automatic Speech Recognition • Updated • 3.62k • 20 -
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Paper • 2104.01027 • Published • 1
A collection of open-source artefacts (datasets + checkpoints) from the first VoxPopuli release.
-
facebook/voxpopuli
Updated • 9.55k • 138 -
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-100k-voxpopuli
Automatic Speech Recognition • Updated • 66 • 4 -
facebook/wav2vec2-base-10k-voxpopuli-ft-cs
Automatic Speech Recognition • Updated • 23
A collection of checkpoints from the second VoxPopuli release.
-
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-bg-voxpopuli-v2
Automatic Speech Recognition • Updated • 12 • 2 -
facebook/wav2vec2-base-cs-voxpopuli-v2
Automatic Speech Recognition • Updated • 10 • 1 -
facebook/wav2vec2-base-da-voxpopuli-v2
Automatic Speech Recognition • Updated • 8
A collection of checkpoints from the HuBERT release, a speech encoder that learns powerful representations from unlabelled audio data.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Paper • 2106.07447 • Published • 4 -
facebook/hubert-base-ls960
Feature Extraction • Updated • 1.48M • 65 -
facebook/hubert-large-ll60k
Feature Extraction • Updated • 85.6k • • 30 -
facebook/hubert-large-ls960-ft
Automatic Speech Recognition • Updated • 330k • 75
Text-to-speech models from fairseq s^2
DINOv2: foundation models producing robust visual features suitable for image-level and pixel-level visual tasks - https://arxiv.org/abs/2304.07193
-
facebook/dinov2-small
Image Feature Extraction • 22.1M • Updated • 1.69M • 51 -
facebook/dinov2-base
Image Feature Extraction • 86.6M • Updated • 1.39M • 160 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 555k • 95 -
facebook/dinov2-giant
Image Feature Extraction • 1B • Updated • 124k • 54
A collection of stereo music generation models as part of the v2 MusicGen release.
Meta LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Foundation models for human tasks. Code: https://github.com/facebookresearch/sapiens
OPT (Open Pretrained Transformer) is a series of open-sourced large causal language models which perform similar in performance to GPT3.