# Feature Specification: Audio Streaming System **ID:** AUDIO-001 **Version:** 1.0 **Status:** Planned **Priority:** Critical ## Overview The audio streaming system handles real-time voice packet capture, encoding, transmission, and playback between clients and server. ## Architecture ### Client-Side Audio Pipeline ``` Microphone Input → Audio Capture → Opus Encoder → Packet Formation → Network Transmission ↓ Network Reception ← Audio Decoder ← Packet Reception ← Speaker Output ``` ### Server-Side Audio Pipeline ``` Client 1 Voice Packets → Voice Packet Router → Broadcast to Channel Members Client 2 Voice Packets → Voice Packet Router → [Client 1, Client 3, ...] Client 3 Voice Packets → Voice Packet Router ``` ## Requirements ### Audio Capture (Client) - **Sample Rate:** 48 kHz (Opus standard) - **Bit Depth:** 16-bit PCM - **Frame Size:** 20ms frames (Opus standard: 960 samples) - **Channels:** Mono or Stereo (initially mono) - **VAD (Voice Activity Detection):** Optional, reduces bandwidth when silent - Support multiple audio devices (fallback to default device) ### Audio Encoding (Client) - **Codec:** Opus with variable bitrate - **Bitrate Range:** 8-128 kbps (configurable) - **Default Bitrate:** 64 kbps - **Latency:** <20ms encoding latency - Frame-based encoding (process 20ms chunks) ### Packet Format ``` [Header] [Payload] ↓ ↓ [SeqNum][Timestamp][SSRC][Payload Length][Opus Data] (2B) (4B) (4B) (2B) (Variable) ``` ### Voice Packet Routing (Server) - Receive voice packets from connected clients - Identify source client and current channel - Broadcast to all connected clients in same channel - Drop packets from clients not authenticated - Handle packet loss gracefully (no retransmission needed for voice) ### Audio Decoding & Playback (Client) - Decode multiple incoming Opus streams - Maintain separate decoders for each speaker - Mix multiple streams for playback - Handle jitter buffer (20-100ms buffer) - Handle packet loss (silence/interpolation) - Support volume adjustment per speaker and master volume ## Performance Requirements - **Latency:** <100ms round-trip (E2E) - **Jitter:** <50ms acceptable variation - **Packet Loss Tolerance:** Acceptable up to 2% without noticeable degradation - **Memory:** <50MB for audio subsystem (including buffers and decoders) - **CPU:** Single audio stream <5% on modern dual-core CPU ## Data Flow ### Publishing Voice Stream ``` User → Microphone → Audio Capture (Device) ↓ Audio Processing (gain, echo cancellation) ↓ Opus Encoder (20ms frames) ↓ RTP-like Packets with metadata ↓ gRPC Streaming to Server ``` ### Receiving Voice Stream ``` Server broadcasts packet to all channel members ↓ Client receives on audio stream listener ↓ Opus Decoder (separate per speaker) ↓ Audio Mix Engine (combine multiple speakers) ↓ Audio Playback Device ↓ Speaker Output ``` ## Error Handling - Lost packets: silence substitution or previous frame interpolation - Decoder errors: skip corrupted packets, log error - Device unavailable: graceful fallback, user notification - Network interruption: auto-reconnect voice stream - Buffer overflow: drop oldest frames, log warning ## Configuration - Audio device selection (OS-dependent enumeration) - Microphone volume level - Speaker volume level - Bitrate preference - Enable/disable voice activity detection - Enable/disable echo cancellation ## Dependencies - Opus codec library (gopxl/beep or libopus bindings) - Audio device access (PortAudio or OS-specific APIs) - RTP/gRPC for packet transport ## Testing Strategy - Unit tests for Opus encoding/decoding - Network simulation tests for packet loss - Integration tests with mock audio devices - Latency measurement benchmarks - Jitter buffer tests with varying packet arrival times