Skip to content

STT Service — Speech-to-Text

STT in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate STT microservice — Atlas selects, configures, and calls the STT provider directly as part of the voice pipeline.

The STT provider is configured per agent via the voice_config.stt_provider field in Compass.


Internal Pipeline

graph LR
    Audio[LiveKit\nAudio] --> VAD[VAD\nSilero]
    VAD -->|Speech frames| SA[Stream Adapter]
    SA -->|Merged audio| HF[Hallucination\nFilter]
    HF -->|Clean audio| STT[STT Provider]
    STT -->|Transcript| Atlas[Atlas\nOrchestrator]

Stages

Stage Description
VAD (Silero) Detects voice activity; suppresses silence and background noise
Stream Adapter Batches frames at end-of-speech before sending to STT
Hallucination Filter Suppresses Whisper echo/silence artefacts within 3 s post-TTS window
STT Provider Transcribes audio to text

End-of-utterance (turn detection) is handled by the VAD detecting END_OF_SPEECH.


Supported Providers

Configured via voice_config.stt_provider in the agent record:

Provider Model(s) Notes
openai gpt-4o-transcribe Recommended; streaming
google chirp_3, latest_long Multilingual
groq whisper-large-v3-turbo, whisper-large-v3 Fast Whisper
deepgram nova-3 Streaming; default fallback
elevenlabs scribe_v2_realtime 99+ languages
cartesia ink-whisper Code-switching support
gladia Solaria-1 Per-utterance language detection + confidence
huggingface Whisper via Inference API On-premise / Arabic
baseten whisper-large-v3 Streaming

Multilingual & Language Detection

For agents configured with multiple languages, Atlas can use Gladia Solaria as a secondary language-detection STT:

  • Each utterance is run through Gladia first for language confidence scoring
  • Once a language is identified above threshold, the pipeline locks to the primary STT provider for that language
  • If the caller explicitly says a language name, detection bypasses the confirmation step

Language codes are set via the language field in the agent config (Compass).


Output

Atlas receives a transcript per completed turn:

{
  "transcript": "I'd like to check the status of my order",
  "language": "en",
  "confidence": 0.97,
  "is_final": true
}

This is passed to the LLM Service as a ChatMessage.


Observability

Atlas logs detailed per-utterance STT diagnostics:

  • stt.vad.decision OTEL span — outcome: forwarded / dropped_hallucination / dropped_echo_tail / dropped_short
  • RMS and peak levels per utterance
  • Ghost segment detection (>15 s accumulation → TTS echo)
  • Hallucination filter decisions with reasoning