OpenSpeak/openspec/changes/add-voice-communication/voice.md

# Spec Delta: Voice Communication

**Change ID:** `add-voice-communication`
**Capability:** Voice Communication
**Type:** NEW

## ADDED Requirements

### Audio Capture & Encoding

#### Requirement: Client shall capture audio from selected microphone device

**Description:** Client application shall record audio from user's selected microphone device at 48kHz sample rate with 16-bit depth in mono format, processing audio in 20ms frames (960 samples).

**Priority:** Critical
**Status:** Proposed

**Details:**
- Sample rate: 48kHz (Opus standard)
- Bit depth: 16-bit PCM
- Channels: Mono (future: stereo support)
- Frame duration: 20ms (960 samples)
- Device selection: User configurable in settings
- Fallback to default device if selected unavailable

**Scenarios:**

#### Scenario: User selects microphone and speaks
```
Given: Client is connected to server
When: User selects microphone from audio settings
And: User unmutes microphone
And: User speaks into microphone
Then: Audio is captured at 48kHz 16-bit mono
And: Frames processed every 20ms
And: Captured audio ready for encoding
```

#### Scenario: Selected device becomes unavailable
```
Given: User had selected specific microphone
When: That microphone is disconnected
Then: Client falls back to default device
And: User is notified of device change
And: Audio capture continues without interruption
```

### Opus Encoding

#### Requirement: Client shall encode captured audio with Opus codec

**Description:** Client shall encode 20ms audio frames using Opus codec at configurable bitrate (default 64kbps, range 8-128kbps) with variable bitrate enabled.

**Priority:** Critical
**Status:** Proposed

**Details:**
- Codec: Opus
- Bitrate: 64kbps default (configurable)
- Bitrate range: 8-128kbps
- Variable bitrate: Enabled
- Encoding latency: <20ms per frame
- Output: Encoded packets ready for transmission

**Scenarios:**

#### Scenario: Client encodes audio frame
```
Given: 20ms of audio captured from microphone
When: Client processes the audio frame
Then: Frame is encoded with Opus at configured bitrate
And: Encoded payload is ready for transmission
And: Encoding latency is <20ms
And: Encoding quality matches bitrate setting
```

#### Scenario: User changes bitrate preference
```
Given: Client is capturing and encoding audio
When: User changes bitrate setting from 64kbps to 32kbps
Then: Subsequent frames encoded at 32kbps
And: Audio quality decreases but bandwidth reduced
And: Change takes effect within 1 second
```

### Voice Packet Transmission

#### Requirement: Client shall transmit encoded voice packets to server

**Description:** Client shall send Opus-encoded voice packets to server via gRPC streaming connection, including metadata (sequence number, timestamp, channel ID).

**Priority:** Critical
**Status:** Proposed

**Scenarios:**

#### Scenario: Client sends voice packet to server
```
Given: Audio is encoded with Opus
When: Client has active connection to server
And: User is in a voice channel
Then: Encoded packet sent to server immediately
And: Packet includes sequence number, timestamp
And: Server receives packet within typical network latency
And: Transmission continues at 20ms intervals per audio frame
```

#### Scenario: Client disconnects mid-speech
```
Given: Client is sending voice packets
When: Network connection is lost
Then: Voice packet transmission stops
And: Local audio capture continues (buffered)
And: Client attempts to reconnect
And: Resume transmission when reconnected (with possible gap)
```

### Server Voice Routing

#### Requirement: Server shall route voice packets to channel members

**Description:** Server shall receive voice packets from publishing client, validate source is authenticated and in channel, and broadcast packet to all other connected members of the same channel.

**Priority:** Critical
**Status:** Proposed

**Scenarios:**

#### Scenario: Server broadcasts voice packet to channel
```
Given: Server receives voice packet from Client A
And: Client A is authenticated
And: Client A is in "general" channel
When: Packet is validated
Then: Packet is broadcast to all other members of "general" channel
And: Each member receives packet within 50ms of reception
And: Packet is not sent back to originating client
And: Other members not in channel do not receive packet
```

#### Scenario: Unauthenticated client sends voice packet
```
Given: A client sends voice packet without valid token
When: Server receives the packet
Then: Packet is dropped
And: Client connection is terminated
And: Error is logged for audit
```

#### Scenario: Server handles many concurrent speakers
```
Given: 5 clients are in same channel
When: All 5 clients speak simultaneously
Then: Server receives packets from all 5 sources
And: Packets routed to all other 4 clients per source
And: Routing latency <100ms for all packets
And: No packets are dropped due to volume
```

### Audio Decoding & Playback

#### Requirement: Client shall decode received voice packets and play audio

**Description:** Client shall receive Opus-encoded voice packets from server for each speaker in channel, decode independently, mix multiple streams, and output to speaker device.

**Priority:** Critical
**Status:** Proposed

**Details:**
- Decode: Opus decoder per speaker
- Mixing: Multiple streams combined for playback
- Playback: Output to selected speaker device
- Volume control: Per-speaker and master volume
- Latency: End-to-end <100ms

**Scenarios:**

#### Scenario: Client receives and plays voice packet
```
Given: Server sends voice packet from Speaker A
When: Client receives packet from channel
Then: Packet is queued in receive buffer
And: Opus decoder decodes packet
And: Audio sample is mixed with other speakers
And: Mixed audio played through speaker device
And: User hears Speaker A clearly
```

#### Scenario: Multiple speakers simultaneously
```
Given: Client in channel with 3 other speakers
When: All 3 speakers transmit simultaneously
Then: Client receives packets from all 3 sources
And: 3 independent Opus decoders active
And: All 3 streams mixed together
And: User hears all 3 speakers blended
And: Volume of each controllable separately
```

#### Scenario: Handle packet loss gracefully
```
Given: Packet loss occurs in network
When: Expected voice packet does not arrive
Then: Jitter buffer detects missing packet
And: Client uses interpolation or silence substitution
And: Playback continues without stopping
And: User notices minor quality drop but no complete loss
```

### Latency Requirements

#### Requirement: Voice communication shall maintain <100ms round-trip latency

**Description:** End-to-end latency from microphone input to speaker output shall not exceed 100ms in typical network conditions. This is critical for real-time conversational quality.

**Priority:** Critical
**Status:** Proposed

**Scenarios:**

#### Scenario: Measure round-trip latency
```
Given: Client A and Client B in same channel
When: Client A captures audio
And: Transmits to server
And: Server broadcasts to Client B
And: Client B decodes and plays
Then: Total latency is <100ms in 95% of measurements
And: Average latency is <80ms
And: No latency spike exceeds 200ms
```

### Voice Activity Detection (Optional)

#### Requirement: Client shall optionally detect voice activity to reduce bandwidth

**Description:** When enabled, voice activity detection (VAD) shall detect silence/absence of speech and suppress transmission of silent frames to reduce bandwidth usage.

**Priority:** Medium
**Status:** Proposed

**Details:**
- VAD: Optional, disabled by default for MVP
- Silence threshold: Configurable
- Bandwidth savings: ~50% reduction when speaking 50% of time
- False positive rate: <5% (silence detected as speech)

**Scenarios:**

#### Scenario: VAD enabled reduces bandwidth
```
Given: User enables voice activity detection
When: User speaks for 30 seconds then pauses for 30 seconds
Then: Bandwidth used only during speaking portions
And: Pause/silence frames not transmitted
And: Total bandwidth ~50% of always-on scenario
And: User hears pause when speaking resumes (immediate)
```

## DEPENDENCIES

### On Other Capabilities
- **Depends:** Authentication (tokens for voice stream auth)
- **Depends:** Channel Management (which channel to route voice to)
- **Depends:** User Presence (tracking who's speaking)
- **Depends:** Server Core (gRPC streaming infrastructure)

### On External Libraries
- Opus codec library
- Audio device library (PortAudio or OS-specific)
- gRPC streaming (already required)

## ACCEPTANCE CRITERIA

- [ ] Voice packets successfully route from source to all channel members
- [ ] Latency measured <100ms round-trip in test scenarios
- [ ] Multiple concurrent speakers (10+) supported without packet loss
- [ ] Packet loss up to 2% handled gracefully
- [ ] CPU usage <5% per active stream on modern dual-core
- [ ] Memory usage <50MB for voice subsystem
- [ ] Unit test coverage >80%
- [ ] Integration tests pass for full voice communication flow
- [ ] Performance benchmarks documented

## TESTING STRATEGY

### Unit Tests
- Test Opus encode/decode with various bitrates
- Test voice packet structure and validation
- Test jitter buffer with varying packet timing
- Test packet loss detection and recovery

### Integration Tests
- Test voice packet flow from client to server to other clients
- Test with multiple concurrent speakers
- Test channel-scoped routing (wrong channel doesn't receive)
- Test authentication required for voice streaming

### Performance Tests
- Benchmark Opus encoding/decoding performance
- Measure round-trip latency with network emulation
- Stress test with 20+ concurrent speakers
- Memory profiling with sustained voice streams

### Manual Testing
- Listen to actual voice quality with different bitrates
- Test with poor network conditions (packet loss, jitter)
- Verify no audio artifacts or cutting off