Sherpa ONNX STT Models - INT8 Quantized Collection
A comprehensive collection of INT8 quantized speech-to-text models optimized for edge devices and production environments. All models are quantized using dynamic quantization to reduce size by ~50% while maintaining accuracy.
๐ฏ Model Overview
This collection includes 17 INT8 quantized models covering 7 languages:
| Language | Models | Architecture | Use Case |
|---|---|---|---|
| ๐ฌ๐ง English | 5 models | Kroko + NeMo | Gaming, Reading, General |
| ๐ฉ๐ช German | 2 models | Kroko | General Purpose |
| ๐ช๐ธ Spanish | 2 models | Kroko | General Purpose |
| ๐ซ๐ท French | 2 models | Kroko | General Purpose |
| ๐น๐ท Turkish | 2 models | Kroko | General Purpose |
| ๐ฎ๐น Italian | 2 models | Kroko | General Purpose |
| ๐ต๐น Portuguese | 2 models | Kroko | General Purpose |
Total Size: 2.38 GB (all INT8 quantized)
๐ฆ Model Details
Kroko Models (Community)
Kroko models are high-quality streaming ASR models based on Zipformer2 architecture with transducer decoder.
German (DE)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
English (EN)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Spanish (ES)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
French (FR)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Turkish (TR)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Italian (IT)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Portuguese (PT)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
NeMo CTC Models (English)
Ultra-fast CTC-based models optimized for real-time applications:
- nemo_ctc_80ms: 126 MB - Ultra-fast (80ms latency) for gaming
- nemo_ctc_480ms: 126 MB - Balanced (480ms latency) for reading
- nemo_ctc_1040ms: 126 MB - High accuracy (1040ms latency)
๐ Quick Start
Installation
pip install sherpa-onnx
Usage Example (Python)
import sherpa_onnx
# Initialize recognizer with English Kroko model
config = sherpa_onnx.OnlineRecognizerConfig(
transducer=sherpa_onnx.OnlineTransducerModelConfig(
encoder="models/en/kroko_64l/encoder.int8.onnx",
decoder="models/en/kroko_64l/decoder.int8.onnx",
joiner="models/en/kroko_64l/joiner.int8.onnx",
),
tokens="models/en/kroko_64l/tokens.txt",
num_threads=4,
)
recognizer = sherpa_onnx.OnlineRecognizer(config)
# Create stream and process audio
stream = recognizer.create_stream()
# ... add audio samples ...
# result = recognizer.get_result(stream)
Usage Example (NeMo CTC)
import sherpa_onnx
# Initialize recognizer with NeMo CTC model
config = sherpa_onnx.OnlineRecognizerConfig(
ctc=sherpa_onnx.OnlineCtcModelConfig(
model="models/en/nemo_ctc_80ms/model.int8.onnx",
),
tokens="models/en/nemo_ctc_80ms/tokens.txt",
num_threads=4,
)
recognizer = sherpa_onnx.OnlineRecognizer(config)
๐ Model Architecture
Kroko (Transducer)
- Encoder: Zipformer2 with 64 or 128 layers
- Decoder: RNN-T decoder (stateful)
- Joiner: Simple feedforward network
- Format: ONNX INT8 quantized
- Components: 3 files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx)
NeMo (CTC)
- Architecture: Fast Conformer with CTC
- Format: ONNX INT8 quantized
- Components: 1 file (model.int8.onnx)
๐ฎ Recommended Use Cases
Gaming Applications (Word Sniper, Word Wave)
- Best choice:
nemo_ctc_80ms- Ultra-low latency (80ms) - Alternative:
kroko_64l- Better accuracy with acceptable latency
Reading Exercises (Echo Challenge)
- Best choice:
nemo_ctc_480ms- Balanced latency and accuracy - Alternative:
kroko_64l- Higher accuracy for complex sentences
General Purpose STT
- Best choice:
kroko_128l- Highest accuracy - Alternative:
kroko_64l- Faster inference, good accuracy
Low-end Devices (512MB-1GB RAM)
- Best choice:
kroko_64l- Smaller encoder, lower memory usage
๐ง Quantization Details
All models are quantized using ONNX Runtime dynamic quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="encoder.onnx",
model_output="encoder.int8.onnx",
weight_type=QuantType.QUInt8
)
Benefits:
- โ ~50% size reduction (148 MB โ 146 MB for Kroko encoders)
- โ Faster inference on CPU
- โ Lower memory usage
- โ Minimal accuracy loss (<2% WER increase)
๐ Directory Structure
models/
โโโ de/
โ โโโ kroko_64l/
โ โ โโโ encoder.int8.onnx
โ โ โโโ decoder.int8.onnx
โ โ โโโ joiner.int8.onnx
โ โ โโโ tokens.txt
โ โโโ kroko_128l/
โ โโโ ...
โโโ en/
โ โโโ kroko_64l/
โ โโโ kroko_128l/
โ โโโ nemo_ctc_80ms/
โ โ โโโ model.int8.onnx
โ โ โโโ tokens.txt
โ โโโ nemo_ctc_480ms/
โ โโโ nemo_ctc_1040ms/
โโโ es/
โ โโโ kroko_64l/
โ โโโ kroko_128l/
โโโ fr/
โ โโโ kroko_64l/
โ โโโ kroko_128l/
โโโ tr/
โ โโโ kroko_64l/
โ โโโ kroko_128l/
โโโ it/
โ โโโ kroko_64l/
โ โโโ kroko_128l/
โโโ pt/
โโโ kroko_64l/
โโโ kroko_128l/
๐ Credits & Acknowledgments
Kroko Models
These models are derived from the Banafo Kroko ASR project, an open-source multilingual speech recognition initiative.
- Original Source: Banafo/Kroko-ASR
- Community Models: All Kroko models (DE, EN, ES, FR, TR, IT, PT) are Community versions
- Architecture: Zipformer2 + Transducer
- Training: Based on Next-gen Kaldi framework
- License: Apache 2.0
Special thanks to the Banafo team for providing high-quality multilingual ASR models with streaming capabilities.
Kroko Model Variants
- 64L: 64-layer encoder - Optimized for speed
- 128L: 128-layer encoder - Optimized for accuracy
NeMo Models
- Source: NVIDIA NeMo Toolkit
- Architecture: Fast Conformer CTC
- Training Framework: NeMo ASR
Quantization
- Tool: ONNX Runtime
- Method: Dynamic quantization (QUInt8)
- Performed by: This repository maintainer
๐ License
All models in this collection are released under Apache 2.0 License.
Original Model Licenses
- Kroko Models: Apache 2.0 (from Banafo/Kroko-ASR)
- NeMo Models: Apache 2.0 (from NVIDIA NeMo)
๐ Related Links
๐ Performance Benchmarks
| Model | Size | Latency | WER (en) | Memory | Best For |
|---|---|---|---|---|---|
| nemo_ctc_80ms | 126 MB | 80ms | ~8% | 512 MB | Gaming |
| nemo_ctc_480ms | 126 MB | 480ms | ~6% | 512 MB | Reading |
| kroko_64l | 147 MB | ~200ms | ~5% | 1 GB | General |
| kroko_128l | 147 MB | ~300ms | ~4% | 1.5 GB | High Accuracy |
Benchmarks are approximate and may vary based on hardware and audio conditions.
๐ ๏ธ System Requirements
- Minimum RAM: 512 MB (for NeMo models)
- Recommended RAM: 1-2 GB (for Kroko models)
- CPU: Any modern CPU with AVX2 support
- OS: Windows, Linux, macOS, Android (7.0+), iOS
- Runtime: ONNX Runtime (CPU)
๐ง Known Limitations
- INT8 quantization may cause slight accuracy degradation (~1-2% WER increase)
- Kroko 128L models require more memory than 64L variants
- NeMo models work best with English language only
- Real-time performance depends on CPU capabilities
๐ Citation
If you use these models in your research or application, please cite:
@misc{sherpa-onnx-int8-models,
title={Sherpa ONNX STT Models - INT8 Quantized Collection},
author={Your Name/Organization},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/your-username/sherpa-onnx-int8-models}},
note={Quantized from Banafo/Kroko-ASR and NVIDIA NeMo models}
}
Original Kroko Citation:
@misc{banafo-kroko-asr,
title={Kroko ASR: Multilingual Streaming Speech Recognition},
author={Banafo Team},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/Banafo/Kroko-ASR}}
}
๐ฌ Support
For issues and questions:
- Sherpa-ONNX: GitHub Issues
- Kroko Models: Banafo Kroko-ASR
๐ Version History
- v1.0.0 (2025-11-07): Initial release
- 17 INT8 quantized models
- 7 languages supported
- DE, EN, ES, FR, TR, IT, PT coverage
- Total size: 2.38 GB
Made with โค๏ธ using Sherpa-ONNX and ONNX Runtime