TTS Service — Text-to-Speech¶
TTS in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate TTS microservice — Atlas selects, configures, and calls the TTS provider directly, then streams the audio back to the caller via LiveKit.
The TTS provider and voice are configured per agent via voice_config.tts_provider and voice_config.voice in Compass.
Supported Providers¶
| Provider | Model(s) | Notes |
|---|---|---|
openai |
gpt-4o-mini-tts |
Default; high quality |
gemini |
gemini-2.5-flash-preview-tts |
Fast, expressive |
elevenlabs |
eleven_turbo_v2_5 |
Natural, multilingual |
google |
Chirp3-HD |
Streaming; Indian languages |
cartesia |
sonic-2, sonic-3 |
Emotion and speed control |
hume |
octave v1, octave v2 |
Expressive, emotional |
huggingface |
Chatterbox | On-premise |
on_premise |
Custom API | localhost:5000 |
Text Pre-Processing¶
Before synthesis, Atlas applies a text processing pipeline to the LLM response:
| Step | Description |
|---|---|
| Abbreviation expansion | Expands common abbreviations (e.g. Dr. → Doctor) — results cached in Redis |
| Number normalisation | Converts digits to spoken form (e.g. 42 → forty-two) |
| Markdown stripping | Removes **bold**, _italic_, bullet points |
| Character cleaning | Removes unspeakable characters |
Output¶
Atlas receives a PCM or Opus audio stream from the TTS provider and forwards it directly to LiveKit for real-time playback to the caller.
TTS output text is logged at INFO level (first 300 characters) for debugging.
Voice Model Selection¶
The voice_config.voice field in the Compass agent record selects the specific voice within the provider's catalogue. Example values:
| Voice ID | Provider | Language | Character |
|---|---|---|---|
Zephyr |
en-IN | Female | |
alloy |
OpenAI | en | Neutral |
Rachel |
ElevenLabs | en | Female |
en-US-Neural2-F |
Google Cloud | en-US | Female |
ar-XA-Standard-B |
Google Cloud | ar | Male |
TO DO
Document the full voice catalogue and per-provider voice ID conventions.
Interruption Handling¶
Atlas monitors the caller audio stream during TTS playback:
- If the caller starts speaking during TTS, Atlas evaluates whether to interrupt based on:
- Minimum interruption duration threshold
- Minimum word count threshold
- Interruptions that don't meet the threshold are suppressed (logged as
interruption rejected) - Valid interruptions stop TTS immediately and process the new utterance