Skip to content

TTS Service — Text-to-Speech

TTS in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate TTS microservice — Atlas selects, configures, and calls the TTS provider directly, then streams the audio back to the caller via LiveKit.

The TTS provider and voice are configured per agent via voice_config.tts_provider and voice_config.voice in Compass.


Supported Providers

Provider Model(s) Notes
openai gpt-4o-mini-tts Default; high quality
gemini gemini-2.5-flash-preview-tts Fast, expressive
elevenlabs eleven_turbo_v2_5 Natural, multilingual
google Chirp3-HD Streaming; Indian languages
cartesia sonic-2, sonic-3 Emotion and speed control
hume octave v1, octave v2 Expressive, emotional
huggingface Chatterbox On-premise
on_premise Custom API localhost:5000

Text Pre-Processing

Before synthesis, Atlas applies a text processing pipeline to the LLM response:

Step Description
Abbreviation expansion Expands common abbreviations (e.g. Dr.Doctor) — results cached in Redis
Number normalisation Converts digits to spoken form (e.g. 42forty-two)
Markdown stripping Removes **bold**, _italic_, bullet points
Character cleaning Removes unspeakable characters

Output

Atlas receives a PCM or Opus audio stream from the TTS provider and forwards it directly to LiveKit for real-time playback to the caller.

TTS output text is logged at INFO level (first 300 characters) for debugging.


Voice Model Selection

The voice_config.voice field in the Compass agent record selects the specific voice within the provider's catalogue. Example values:

Voice ID Provider Language Character
Zephyr Google en-IN Female
alloy OpenAI en Neutral
Rachel ElevenLabs en Female
en-US-Neural2-F Google Cloud en-US Female
ar-XA-Standard-B Google Cloud ar Male

TO DO

Document the full voice catalogue and per-provider voice ID conventions.


Interruption Handling

Atlas monitors the caller audio stream during TTS playback:

  • If the caller starts speaking during TTS, Atlas evaluates whether to interrupt based on:
  • Minimum interruption duration threshold
  • Minimum word count threshold
  • Interruptions that don't meet the threshold are suppressed (logged as interruption rejected)
  • Valid interruptions stop TTS immediately and process the new utterance