lmz commited on
Commit
5ccfc8f
·
verified ·
1 Parent(s): 0dc5960

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ library_name: moshi
6
+ tags:
7
+ - audio
8
+ - automatic-speech-recognition
9
+ ---
10
+ # Model Card for Kyutai STT
11
+
12
+ See also the [project page](https://kyutai.org/next/stt)
13
+ and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
14
+
15
+ This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR).
16
+ Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript,
17
+ our model starts to output the transcript as soon as a few seconds of audio become available.
18
+
19
+ ## Model Details
20
+
21
+ The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens.
22
+ The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
23
+
24
+ We release two models:
25
+ - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
26
+ - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
27
+
28
+ ## Model Description
29
+
30
+ Kyutai STT is a decoder-only model for streaming speech-to-text.
31
+ It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream.
32
+ The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
33
+
34
+ * Developed by: Kyutai
35
+ * Model type: Streaming Speech-to-Text transcription.
36
+ * Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en`
37
+ * License: Model weights are licensed under CC-BY 4.0
38
+ * Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/)
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+
44
+ The model can be used for streaming speech-to-text.
45
+ It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes.
46
+ The model produces transcripts with capitalization and punctuation.
47
+ The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.
48
+
49
+ ## How to Get Started with the Model
50
+
51
+ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
52
+
53
+ ## Training Details
54
+
55
+ ### Training Data
56
+
57
+ Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content.
58
+ For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).
59
+
60
+ For `kyutai/stt-2.6b-en`:
61
+
62
+ - Finetuning stage: We then finetune the model on a collection of public datasets with
63
+ ground-truth transcripts. This dataset contains 24000 hours of audio.
64
+
65
+ - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio.
66
+ The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).
67
+
68
+ For `kyutai/stt-1b-en_fr`:
69
+
70
+ - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).
71
+
72
+ ### Compute Infrastructure
73
+
74
+ Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
75
+
76
+ ## Model Card Authors
77
+
78
+ Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez