OpenSpeak/openspec/specs/001-audio-streaming.md

# Feature Specification: Audio Streaming System

**ID:** AUDIO-001
**Version:** 1.0
**Status:** Planned
**Priority:** Critical

## Overview
The audio streaming system handles real-time voice packet capture, encoding, transmission, and playback between clients and server.

## Architecture

### Client-Side Audio Pipeline
```
Microphone Input → Audio Capture → Opus Encoder → Packet Formation → Network Transmission
                                                           ↓
Network Reception ← Audio Decoder ← Packet Reception ← Speaker Output
```

### Server-Side Audio Pipeline
```
Client 1 Voice Packets → Voice Packet Router → Broadcast to Channel Members
Client 2 Voice Packets → Voice Packet Router → [Client 1, Client 3, ...]
Client 3 Voice Packets → Voice Packet Router
```

## Requirements

### Audio Capture (Client)
- **Sample Rate:** 48 kHz (Opus standard)
- **Bit Depth:** 16-bit PCM
- **Frame Size:** 20ms frames (Opus standard: 960 samples)
- **Channels:** Mono or Stereo (initially mono)
- **VAD (Voice Activity Detection):** Optional, reduces bandwidth when silent
- Support multiple audio devices (fallback to default device)

### Audio Encoding (Client)
- **Codec:** Opus with variable bitrate
- **Bitrate Range:** 8-128 kbps (configurable)
- **Default Bitrate:** 64 kbps
- **Latency:** <20ms encoding latency
- Frame-based encoding (process 20ms chunks)

### Packet Format
```
[Header] [Payload]
  ↓         ↓
[SeqNum][Timestamp][SSRC][Payload Length][Opus Data]
 (2B)      (4B)    (4B)      (2B)       (Variable)
```

### Voice Packet Routing (Server)
- Receive voice packets from connected clients
- Identify source client and current channel
- Broadcast to all connected clients in same channel
- Drop packets from clients not authenticated
- Handle packet loss gracefully (no retransmission needed for voice)

### Audio Decoding & Playback (Client)
- Decode multiple incoming Opus streams
- Maintain separate decoders for each speaker
- Mix multiple streams for playback
- Handle jitter buffer (20-100ms buffer)
- Handle packet loss (silence/interpolation)
- Support volume adjustment per speaker and master volume

## Performance Requirements
- **Latency:** <100ms round-trip (E2E)
- **Jitter:** <50ms acceptable variation
- **Packet Loss Tolerance:** Acceptable up to 2% without noticeable degradation
- **Memory:** <50MB for audio subsystem (including buffers and decoders)
- **CPU:** Single audio stream <5% on modern dual-core CPU

## Data Flow

### Publishing Voice Stream
```
User → Microphone → Audio Capture (Device)
    ↓
Audio Processing (gain, echo cancellation)
    ↓
Opus Encoder (20ms frames)
    ↓
RTP-like Packets with metadata
    ↓
gRPC Streaming to Server
```

### Receiving Voice Stream
```
Server broadcasts packet to all channel members
    ↓
Client receives on audio stream listener
    ↓
Opus Decoder (separate per speaker)
    ↓
Audio Mix Engine (combine multiple speakers)
    ↓
Audio Playback Device
    ↓
Speaker Output
```

## Error Handling
- Lost packets: silence substitution or previous frame interpolation
- Decoder errors: skip corrupted packets, log error
- Device unavailable: graceful fallback, user notification
- Network interruption: auto-reconnect voice stream
- Buffer overflow: drop oldest frames, log warning

## Configuration
- Audio device selection (OS-dependent enumeration)
- Microphone volume level
- Speaker volume level
- Bitrate preference
- Enable/disable voice activity detection
- Enable/disable echo cancellation

## Dependencies
- Opus codec library (gopxl/beep or libopus bindings)
- Audio device access (PortAudio or OS-specific APIs)
- RTP/gRPC for packet transport

## Testing Strategy
- Unit tests for Opus encoding/decoding
- Network simulation tests for packet loss
- Integration tests with mock audio devices
- Latency measurement benchmarks
- Jitter buffer tests with varying packet arrival times