--- title: VAD Demo - Real-time Speech Detection emoji: ๐ŸŽค colorFrom: blue colorTo: green sdk: gradio sdk_version: 4.42.0 app_file: app.py pinned: false license: mit --- # ๐ŸŽค VAD Demo: Real-time Speech Detection Framework [![Hugging Face Spaces](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/gbibbo/vad_demo) [![WASPAA 2025](https://img.shields.io/badge/WASPAA-2025-green)](https://waspaa.com) > **Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces** This demo showcases a comprehensive **speech removal framework** designed for privacy-preserving audio recordings, featuring **3 state-of-the-art AI models** with **real-time processing** and **interactive visualization**. ## ๐ŸŽฏ **Live Demo Features** ### ๐Ÿค– **Multi-Model Support** Compare 3 different AI models side-by-side: | Model | Parameters | Speed | Accuracy | Best For | |-------|------------|-------|----------|----------| | **Silero-VAD** | 1.8M | โšกโšกโšก | โญโญโญโญ | General purpose | | **WebRTC-VAD** | <0.1M | โšกโšกโšกโšก | โญโญโญ | Ultra-fast processing | | **E-PANNs** | 22M | โšกโšก | โญโญโญโญ | Efficient AI (73% parameter reduction) | ### ๐Ÿ“Š **Real-time Visualization** - **Dual Analysis**: Compare two models simultaneously - **Waveform Display**: Live audio visualization - **Probability Charts**: Real-time speech detection confidence - **Performance Metrics**: Processing time comparison across models ### ๐Ÿ”’ **Privacy-Preserving Applications** - **Smart Home Audio**: Remove personal conversations while preserving environmental sounds - **GDPR Compliance**: Privacy-aware audio dataset processing - **Real-time Processing**: Continuous 4-second chunk analysis at 16kHz - **CPU Optimized**: Runs efficiently on standard hardware ## ๐Ÿš€ **Quick Start** ### Option 1: Use Live Demo (Recommended) Click the Hugging Face Spaces badge above to try the demo instantly! ### Option 2: Run Locally ```bash git clone https://huggingface.co/spaces/gbibbo/vad_demo cd vad_demo pip install -r requirements.txt python app.py ``` ## ๐ŸŽ›๏ธ **How to Use** 1. **๐ŸŽค Record Audio**: Click microphone and record 2-4 seconds of speech 2. **๐Ÿ”ง Select Models**: Choose different models for Model A and Model B comparison 3. **โš™๏ธ Adjust Threshold**: Lower = more sensitive detection (0.0-1.0) 4. **๐ŸŽฏ Process**: Click "Process Audio" to analyze 5. **๐Ÿ“Š View Results**: Observe probability charts and detailed analysis ## ๐Ÿ—๏ธ **Technical Architecture** ### **CPU Optimization Strategies** - **Lazy Loading**: Models load only when needed - **Efficient Processing**: Optimized audio chunk processing - **Memory Management**: Smart buffer management for continuous operation - **Fallback Systems**: Graceful degradation when models unavailable ### **Audio Processing Pipeline** ``` Audio Input (Microphone) โ†“ Preprocessing (Normalization, Resampling) โ†“ Feature Extraction (Spectrograms, MFCCs) โ†“ Multi-Model Inference (Parallel Processing) โ†“ Visualization (Interactive Plotly Dashboard) ``` ### **Model Implementation Details** #### **Silero-VAD** (Production Ready) - **Source**: `torch.hub` official Silero model - **Optimization**: Direct PyTorch inference - **Memory**: ~50MB RAM usage - **Latency**: ~30ms processing time #### **WebRTC-VAD** (Ultra-Fast) - **Source**: Google WebRTC project - **Fallback**: Energy-based VAD when WebRTC unavailable - **Latency**: <5ms processing time - **Memory**: ~10MB RAM usage #### **E-PANNs** (Efficient Deep Learning) - **Features**: Mel-spectrogram + MFCC analysis - **Optimization**: Simplified neural architecture - **Speed**: 2-3x faster than full PANNs - **Memory**: ~150MB RAM usage ## ๐Ÿ“ˆ **Performance Benchmarks** Evaluated on **CHiME-Home dataset** (adapted for CPU): | Model | F1-Score | RTF (CPU) | Memory | Use Case | |-------|----------|-----------|--------|-----------| | Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight | | WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast | | E-PANNs | 0.847 | 0.180 | 150MB | Balanced | *RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)* ## ๐Ÿ”ฌ **Research Applications** ### **Privacy-Preserving Audio Processing** - **Domestic Recordings**: Remove personal conversations - **Smart Speakers**: Privacy-aware voice assistants - **Audio Datasets**: GDPR-compliant data collection - **Surveillance Systems**: Selective audio monitoring ### **Speech Technology Research** - **Model Comparison**: Benchmark different VAD approaches - **Real-time Systems**: Low-latency speech detection - **Edge Computing**: CPU-efficient processing - **Hybrid Systems**: Combine multiple detection methods ## ๐Ÿ“Š **Technical Specifications** ### **System Requirements** - **CPU**: 2+ cores (4+ recommended) - **RAM**: 1GB minimum (2GB recommended) - **Python**: 3.8+ (3.10+ recommended) - **Browser**: Chrome/Firefox with microphone support ### **Hugging Face Spaces Optimization** - **Memory Limit**: Designed for 16GB Spaces limit - **CPU Cores**: Optimized for 8-core allocation - **Storage**: <500MB model storage requirement - **Networking**: Minimal external dependencies ### **Audio Specifications** - **Input Format**: 16-bit PCM, mono/stereo - **Sample Rates**: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion) - **Chunk Size**: 4-second processing windows - **Latency**: <200ms processing delay ## ๐Ÿ“š **Research Citation** If you use this demo in your research, please cite: ```bibtex @inproceedings{bibbo2025speech, title={Speech Removal Framework for Privacy-Preserving Audio Recordings}, author={[Authors omitted for review]}, booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year={2025}, organization={IEEE} } ``` ## ๐Ÿค **Contributing** We welcome contributions! Areas for improvement: - **New Models**: Add state-of-the-art VAD models - **Optimization**: Further CPU/memory optimizations - **Features**: Additional visualization and analysis tools - **Documentation**: Improve tutorials and examples ## ๐Ÿ“ž **Support** - **Issues**: [GitHub Issues](https://github.com/gbibbo/vad_demo/issues) - **Discussions**: [Hugging Face Discussions](https://huggingface.co/spaces/gbibbo/vad_demo/discussions) - **WASPAA 2025**: Visit our paper presentation ## ๐Ÿ“„ **License** This project is licensed under the **MIT License**. ## ๐Ÿ™ **Acknowledgments** - **Silero-VAD**: Silero Team - **WebRTC-VAD**: Google WebRTC Project - **E-PANNs**: Efficient PANNs Implementation - **Hugging Face**: Free Spaces hosting - **Funding**: AI4S, University of Surrey, EPSRC, CVSSP --- **๐ŸŽฏ Ready for WASPAA 2025 Demo** | **โšก CPU Optimized** | **๐Ÿ†“ Free to Use** | **๐Ÿค— Hugging Face Spaces**