lematt1991 commited on
Commit
3348205
·
verified ·
1 Parent(s): dc6f393

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +168 -16
README.md CHANGED
@@ -1,31 +1,183 @@
1
-
2
  ---
3
  license: apache-2.0
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Perception Encoder Audio-Video
7
 
8
- ## Model Summary
 
 
 
 
 
 
 
9
 
10
- Perception Encoder Audio-Video (PE-AV) is a family of state-of-the-art encoders for audio and video understanding trained via scaled contrastive learning, built on top of the [PE image/video encoder](https://arxiv.org/abs/2504.13181) (PE)
11
 
12
- The model is available in the following sizes:
13
 
14
- - [`pe-av-small`](https://huggingface.co/facebook/pe-av-small): 12 layers, 209M parameters
15
- - [`pe-av-base`](https://huggingface.co/facebook/pe-av-base): 16 layers, 396M parameters
16
- - [`pe-av-large`](https://huggingface.co/facebook/pe-av-large): 28L, 1.597B parameters
17
 
18
- For each size we additionally provide a version that samples a fixed 16-frames for the video branch for efficiency:
19
 
20
- - [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame): 12 layers, 209M parameters
21
- - [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame): 16 layers, 396M parameters
22
- - [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame): 28L, 1.597B parameters
23
 
 
24
 
25
- ## Usage
 
 
26
 
27
- Install `transformers` starting from version v4.34.0
 
 
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
- pip install 'transformers>=4.34.0'
31
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Perception Encoder Audio-Visual (PE-AV)
5
+
6
+ PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities.
7
+
8
+ ## Model Description
9
+
10
+ PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode:
11
+ - **Audio only**: Extract audio embeddings from audio waveforms
12
+ - **Video only**: Extract visual embeddings from video frames
13
+ - **Audio-Video**: Extract joint audio-visual embeddings
14
+ - **Text**: Extract text embeddings optimized for different modality pairs
15
+
16
+ ## Model Variants
17
 
18
+ We release 6 model checkpoints with varying sizes and capabilities:
19
 
20
+ | Model | Avg Retrieval | Video Frames used |
21
+ |-------|---------------|-------------------|
22
+ | [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) | 45.2 | 16 frames |
23
+ | [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) | 47.0 | 16 frames |
24
+ | [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) | 48.2 | 16 frames |
25
+ | [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) | 48.1 | all frames |
26
+ | [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) | 50.2 | all frames |
27
+ | [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) | 51.6 | all frames |
28
 
29
+ The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos.
30
 
31
+ ## Quick Start
32
 
33
+ The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries
 
 
34
 
35
+ ## `perception_models` Usage
36
 
37
+ ```python
38
+ import torch
39
+ from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform
40
 
41
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
42
 
43
+ # Load model and transform
44
+ model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
45
+ transform = PEAudioVisualTransform.from_config("pe-av-large")
46
 
47
+ video_files = ["video1.mp4", "video2.mp4"]
48
+ descriptions = ["description1", "description2"]
49
+ audio_files = ["audio1.wav", "audio2.wav"]
50
 
51
+ # Process inputs and get embeddings
52
+ inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)
53
+
54
+ with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
55
+ outputs = model(**inputs)
56
+
57
+ # Access different embeddings
58
+ audio_embeds = outputs.audio_embeds # Audio-only embeddings
59
+ visual_embeds = outputs.visual_embeds # Video-only embeddings
60
+ audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings
61
+ audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio
62
+ visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video
63
+ audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual
64
+ audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
65
+ visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding
66
+
67
+ # Compute the dot product to get their similarities
68
+ audio_visual_similarity = audio_embeds @ visual_embeds.T
69
+ # When computing similarity against text embeddings, use the
70
+ # appropriate text embedding based on the other modality
71
+ audio_text_similarity = audio_embeds @ audio_text_embeds.T
72
+ video_text_similarity = visual_embeds @ visual_text_embeds.T
73
  ```
74
+
75
+ Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`. For example:
76
+
77
+ ```python
78
+ inputs = transform(videos=video_files, text=descriptions).to(device)
79
+
80
+ with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
81
+ outputs = model(**inputs)
82
+
83
+ audio_embeds = outputs.audio_embeds # None
84
+ visual_embeds = outputs.visual_embeds # available
85
+ audio_visual_embeds = outputs.audio_visual_embeds # None
86
+ audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
87
+ audio_text_embeds = outputs.audio_text_embeds # None
88
+ visual_text_embeds = outputs.visual_text_embeds # available
89
+ audio_plus_text_embeds = outputs.audio_plus_text_embeds # None
90
+ visual_plus_text_embeds = outputs.visual_plus_text_embeds # Available
91
+ ```
92
+
93
+ We also provide methods for directly encoding an individual modality:
94
+
95
+ ```python
96
+ def encode_video_text(self, input_ids, attention_mask=None)
97
+ def encode_audio_text(self, input_ids, attention_mask=None)
98
+ def encode_audio_video_text(self, input_ids, attention_mask=None)
99
+ def encode_audio(self, input_values, padding_mask=None, input_features=None)
100
+ def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None)
101
+ def encode_audio_video(
102
+ self,
103
+ input_values,
104
+ pixel_values_videos,
105
+ padding_mask=None,
106
+ padding_mask_videos=None,
107
+ pe_features=None, # optionally re-use pre-computed PE features
108
+ input_features=None, # Optionally re-use pre-computed audio codec features
109
+ )
110
+ def encode_audio_plus_text(
111
+ self,
112
+ input_ids,
113
+ input_values,
114
+ attention_mask=None,
115
+ padding_mask=None,
116
+ input_features=None # Optionally re-use pre-computed audio codec features
117
+ )
118
+ def encode_video_plus_text(
119
+ self,
120
+ input_ids,
121
+ pixel_values_videos,
122
+ attention_mask=None,
123
+ padding_mask_videos=None,
124
+ pe_features=None, # optionally re-use pre-computed PE features
125
+ )
126
+ ```
127
+
128
+ ## `transformers` Usage
129
+
130
+ ```python
131
+ from transformers import PeAudioVideoModel, PeAudioVideoProcessor
132
+ import torch
133
+
134
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
135
+ model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
136
+ processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")
137
+
138
+ model = model.to(device)
139
+
140
+ video_files = ["video1.mp4", "video2.mp4"]
141
+ descriptions = ["description1", "description2"]
142
+ audio_files = ["audio1.wav", "audio2.wav"]
143
+
144
+ # Process inputs and get embeddings
145
+ inputs = processor(
146
+ videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
147
+ )
148
+
149
+ with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
150
+ outputs = model(**inputs.to(device))
151
+
152
+ audio_embeds = outputs.audio_embeds # Audio-only embeddings
153
+ video_embeds = outputs.video_embeds # Video-only embeddings
154
+ audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings
155
+ text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video
156
+ text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
157
+ text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
158
+ audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
159
+ video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding
160
+ ```
161
+
162
+ Note that arguments are not optional. We provide variant's on the `forward` method for different modality combinations:
163
+
164
+ ```python
165
+ def forward_text_audio
166
+ def forward_text_video
167
+ def forward_audio_video
168
+ ```
169
+
170
+ ## Citation
171
+
172
+ ```bibtex
173
+ @article{pe-av2025,
174
+ title={PEAV: An Audiovisual Perception Encoder via Large-Scale Multimodal Correspondence Learning},
175
+ author={Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu},
176
+ url={arxiv link coming soon}
177
+ year={2025}
178
+ }
179
+ ```
180
+
181
+ ## License
182
+
183
+ This model is released under the Apache 2.0 license.