# Spec Delta: Voice Communication **Change ID:** `add-voice-communication` **Capability:** Voice Communication **Type:** NEW ## ADDED Requirements ### Audio Capture & Encoding #### Requirement: Client shall capture audio from selected microphone device **Description:** Client application shall record audio from user's selected microphone device at 48kHz sample rate with 16-bit depth in mono format, processing audio in 20ms frames (960 samples). **Priority:** Critical **Status:** Proposed **Details:** - Sample rate: 48kHz (Opus standard) - Bit depth: 16-bit PCM - Channels: Mono (future: stereo support) - Frame duration: 20ms (960 samples) - Device selection: User configurable in settings - Fallback to default device if selected unavailable **Scenarios:** #### Scenario: User selects microphone and speaks ``` Given: Client is connected to server When: User selects microphone from audio settings And: User unmutes microphone And: User speaks into microphone Then: Audio is captured at 48kHz 16-bit mono And: Frames processed every 20ms And: Captured audio ready for encoding ``` #### Scenario: Selected device becomes unavailable ``` Given: User had selected specific microphone When: That microphone is disconnected Then: Client falls back to default device And: User is notified of device change And: Audio capture continues without interruption ``` ### Opus Encoding #### Requirement: Client shall encode captured audio with Opus codec **Description:** Client shall encode 20ms audio frames using Opus codec at configurable bitrate (default 64kbps, range 8-128kbps) with variable bitrate enabled. **Priority:** Critical **Status:** Proposed **Details:** - Codec: Opus - Bitrate: 64kbps default (configurable) - Bitrate range: 8-128kbps - Variable bitrate: Enabled - Encoding latency: <20ms per frame - Output: Encoded packets ready for transmission **Scenarios:** #### Scenario: Client encodes audio frame ``` Given: 20ms of audio captured from microphone When: Client processes the audio frame Then: Frame is encoded with Opus at configured bitrate And: Encoded payload is ready for transmission And: Encoding latency is <20ms And: Encoding quality matches bitrate setting ``` #### Scenario: User changes bitrate preference ``` Given: Client is capturing and encoding audio When: User changes bitrate setting from 64kbps to 32kbps Then: Subsequent frames encoded at 32kbps And: Audio quality decreases but bandwidth reduced And: Change takes effect within 1 second ``` ### Voice Packet Transmission #### Requirement: Client shall transmit encoded voice packets to server **Description:** Client shall send Opus-encoded voice packets to server via gRPC streaming connection, including metadata (sequence number, timestamp, channel ID). **Priority:** Critical **Status:** Proposed **Scenarios:** #### Scenario: Client sends voice packet to server ``` Given: Audio is encoded with Opus When: Client has active connection to server And: User is in a voice channel Then: Encoded packet sent to server immediately And: Packet includes sequence number, timestamp And: Server receives packet within typical network latency And: Transmission continues at 20ms intervals per audio frame ``` #### Scenario: Client disconnects mid-speech ``` Given: Client is sending voice packets When: Network connection is lost Then: Voice packet transmission stops And: Local audio capture continues (buffered) And: Client attempts to reconnect And: Resume transmission when reconnected (with possible gap) ``` ### Server Voice Routing #### Requirement: Server shall route voice packets to channel members **Description:** Server shall receive voice packets from publishing client, validate source is authenticated and in channel, and broadcast packet to all other connected members of the same channel. **Priority:** Critical **Status:** Proposed **Scenarios:** #### Scenario: Server broadcasts voice packet to channel ``` Given: Server receives voice packet from Client A And: Client A is authenticated And: Client A is in "general" channel When: Packet is validated Then: Packet is broadcast to all other members of "general" channel And: Each member receives packet within 50ms of reception And: Packet is not sent back to originating client And: Other members not in channel do not receive packet ``` #### Scenario: Unauthenticated client sends voice packet ``` Given: A client sends voice packet without valid token When: Server receives the packet Then: Packet is dropped And: Client connection is terminated And: Error is logged for audit ``` #### Scenario: Server handles many concurrent speakers ``` Given: 5 clients are in same channel When: All 5 clients speak simultaneously Then: Server receives packets from all 5 sources And: Packets routed to all other 4 clients per source And: Routing latency <100ms for all packets And: No packets are dropped due to volume ``` ### Audio Decoding & Playback #### Requirement: Client shall decode received voice packets and play audio **Description:** Client shall receive Opus-encoded voice packets from server for each speaker in channel, decode independently, mix multiple streams, and output to speaker device. **Priority:** Critical **Status:** Proposed **Details:** - Decode: Opus decoder per speaker - Mixing: Multiple streams combined for playback - Playback: Output to selected speaker device - Volume control: Per-speaker and master volume - Latency: End-to-end <100ms **Scenarios:** #### Scenario: Client receives and plays voice packet ``` Given: Server sends voice packet from Speaker A When: Client receives packet from channel Then: Packet is queued in receive buffer And: Opus decoder decodes packet And: Audio sample is mixed with other speakers And: Mixed audio played through speaker device And: User hears Speaker A clearly ``` #### Scenario: Multiple speakers simultaneously ``` Given: Client in channel with 3 other speakers When: All 3 speakers transmit simultaneously Then: Client receives packets from all 3 sources And: 3 independent Opus decoders active And: All 3 streams mixed together And: User hears all 3 speakers blended And: Volume of each controllable separately ``` #### Scenario: Handle packet loss gracefully ``` Given: Packet loss occurs in network When: Expected voice packet does not arrive Then: Jitter buffer detects missing packet And: Client uses interpolation or silence substitution And: Playback continues without stopping And: User notices minor quality drop but no complete loss ``` ### Latency Requirements #### Requirement: Voice communication shall maintain <100ms round-trip latency **Description:** End-to-end latency from microphone input to speaker output shall not exceed 100ms in typical network conditions. This is critical for real-time conversational quality. **Priority:** Critical **Status:** Proposed **Scenarios:** #### Scenario: Measure round-trip latency ``` Given: Client A and Client B in same channel When: Client A captures audio And: Transmits to server And: Server broadcasts to Client B And: Client B decodes and plays Then: Total latency is <100ms in 95% of measurements And: Average latency is <80ms And: No latency spike exceeds 200ms ``` ### Voice Activity Detection (Optional) #### Requirement: Client shall optionally detect voice activity to reduce bandwidth **Description:** When enabled, voice activity detection (VAD) shall detect silence/absence of speech and suppress transmission of silent frames to reduce bandwidth usage. **Priority:** Medium **Status:** Proposed **Details:** - VAD: Optional, disabled by default for MVP - Silence threshold: Configurable - Bandwidth savings: ~50% reduction when speaking 50% of time - False positive rate: <5% (silence detected as speech) **Scenarios:** #### Scenario: VAD enabled reduces bandwidth ``` Given: User enables voice activity detection When: User speaks for 30 seconds then pauses for 30 seconds Then: Bandwidth used only during speaking portions And: Pause/silence frames not transmitted And: Total bandwidth ~50% of always-on scenario And: User hears pause when speaking resumes (immediate) ``` ## DEPENDENCIES ### On Other Capabilities - **Depends:** Authentication (tokens for voice stream auth) - **Depends:** Channel Management (which channel to route voice to) - **Depends:** User Presence (tracking who's speaking) - **Depends:** Server Core (gRPC streaming infrastructure) ### On External Libraries - Opus codec library - Audio device library (PortAudio or OS-specific) - gRPC streaming (already required) ## ACCEPTANCE CRITERIA - [ ] Voice packets successfully route from source to all channel members - [ ] Latency measured <100ms round-trip in test scenarios - [ ] Multiple concurrent speakers (10+) supported without packet loss - [ ] Packet loss up to 2% handled gracefully - [ ] CPU usage <5% per active stream on modern dual-core - [ ] Memory usage <50MB for voice subsystem - [ ] Unit test coverage >80% - [ ] Integration tests pass for full voice communication flow - [ ] Performance benchmarks documented ## TESTING STRATEGY ### Unit Tests - Test Opus encode/decode with various bitrates - Test voice packet structure and validation - Test jitter buffer with varying packet timing - Test packet loss detection and recovery ### Integration Tests - Test voice packet flow from client to server to other clients - Test with multiple concurrent speakers - Test channel-scoped routing (wrong channel doesn't receive) - Test authentication required for voice streaming ### Performance Tests - Benchmark Opus encoding/decoding performance - Measure round-trip latency with network emulation - Stress test with 20+ concurrent speakers - Memory profiling with sustained voice streams ### Manual Testing - Listen to actual voice quality with different bitrates - Test with poor network conditions (packet loss, jitter) - Verify no audio artifacts or cutting off