STT Service — Speech-to-Text¶

STT in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate STT microservice — Atlas selects, configures, and calls the STT provider directly as part of the voice pipeline.

The STT provider is configured per agent via the voice_config.stt_provider field in Compass.

Internal Pipeline¶

graph LR
    Audio[LiveKit\nAudio] --> VAD[VAD\nSilero]
    VAD -->|Speech frames| SA[Stream Adapter]
    SA -->|Merged audio| HF[Hallucination\nFilter]
    HF -->|Clean audio| STT[STT Provider]
    STT -->|Transcript| Atlas[Atlas\nOrchestrator]

Stages¶

Stage	Description
VAD (Silero)	Detects voice activity; suppresses silence and background noise
Stream Adapter	Batches frames at end-of-speech before sending to STT
Hallucination Filter	Suppresses Whisper echo/silence artefacts within 3 s post-TTS window
STT Provider	Transcribes audio to text

End-of-utterance (turn detection) is handled by the VAD detecting END_OF_SPEECH.

Supported Providers¶

Configured via voice_config.stt_provider in the agent record:

Provider	Model(s)	Notes
`openai`	`gpt-4o-transcribe`	Recommended; streaming
`google`	`chirp_3`, `latest_long`	Multilingual
`groq`	`whisper-large-v3-turbo`, `whisper-large-v3`	Fast Whisper
`deepgram`	`nova-3`	Streaming; default fallback
`elevenlabs`	`scribe_v2_realtime`	99+ languages
`cartesia`	`ink-whisper`	Code-switching support
`gladia`	`Solaria-1`	Per-utterance language detection + confidence
`huggingface`	Whisper via Inference API	On-premise / Arabic
`baseten`	`whisper-large-v3`	Streaming

Multilingual & Language Detection¶

For agents configured with multiple languages, Atlas can use Gladia Solaria as a secondary language-detection STT:

Each utterance is run through Gladia first for language confidence scoring
Once a language is identified above threshold, the pipeline locks to the primary STT provider for that language
If the caller explicitly says a language name, detection bypasses the confirmation step

Language codes are set via the language field in the agent config (Compass).

Output¶

Atlas receives a transcript per completed turn:

{
  "transcript": "I'd like to check the status of my order",
  "language": "en",
  "confidence": 0.97,
  "is_final": true
}

This is passed to the LLM Service as a ChatMessage.

Observability¶

Atlas logs detailed per-utterance STT diagnostics:

stt.vad.decision OTEL span — outcome: forwarded / dropped_hallucination / dropped_echo_tail / dropped_short
RMS and peak levels per utterance
Ghost segment detection (>15 s accumulation → TTS echo)
Hallucination filter decisions with reasoning