STT Service — Speech-to-Text¶
STT in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate STT microservice — Atlas selects, configures, and calls the STT provider directly as part of the voice pipeline.
The STT provider is configured per agent via the voice_config.stt_provider field in Compass.
Internal Pipeline¶
graph LR
Audio[LiveKit\nAudio] --> VAD[VAD\nSilero]
VAD -->|Speech frames| SA[Stream Adapter]
SA -->|Merged audio| HF[Hallucination\nFilter]
HF -->|Clean audio| STT[STT Provider]
STT -->|Transcript| Atlas[Atlas\nOrchestrator]
Stages¶
| Stage | Description |
|---|---|
| VAD (Silero) | Detects voice activity; suppresses silence and background noise |
| Stream Adapter | Batches frames at end-of-speech before sending to STT |
| Hallucination Filter | Suppresses Whisper echo/silence artefacts within 3 s post-TTS window |
| STT Provider | Transcribes audio to text |
End-of-utterance (turn detection) is handled by the VAD detecting END_OF_SPEECH.
Supported Providers¶
Configured via voice_config.stt_provider in the agent record:
| Provider | Model(s) | Notes |
|---|---|---|
openai |
gpt-4o-transcribe |
Recommended; streaming |
google |
chirp_3, latest_long |
Multilingual |
groq |
whisper-large-v3-turbo, whisper-large-v3 |
Fast Whisper |
deepgram |
nova-3 |
Streaming; default fallback |
elevenlabs |
scribe_v2_realtime |
99+ languages |
cartesia |
ink-whisper |
Code-switching support |
gladia |
Solaria-1 |
Per-utterance language detection + confidence |
huggingface |
Whisper via Inference API | On-premise / Arabic |
baseten |
whisper-large-v3 |
Streaming |
Multilingual & Language Detection¶
For agents configured with multiple languages, Atlas can use Gladia Solaria as a secondary language-detection STT:
- Each utterance is run through Gladia first for language confidence scoring
- Once a language is identified above threshold, the pipeline locks to the primary STT provider for that language
- If the caller explicitly says a language name, detection bypasses the confirmation step
Language codes are set via the language field in the agent config (Compass).
Output¶
Atlas receives a transcript per completed turn:
{
"transcript": "I'd like to check the status of my order",
"language": "en",
"confidence": 0.97,
"is_final": true
}
This is passed to the LLM Service as a ChatMessage.
Observability¶
Atlas logs detailed per-utterance STT diagnostics:
stt.vad.decisionOTEL span — outcome:forwarded/dropped_hallucination/dropped_echo_tail/dropped_short- RMS and peak levels per utterance
- Ghost segment detection (>15 s accumulation → TTS echo)
- Hallucination filter decisions with reasoning