LLM Service¶
LLM inference in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate LLM microservice — Atlas manages the full conversation context and calls the configured LLM provider directly.
The LLM provider is configured per agent via voice_config.llm_model and voice_config.pipeline_mode in Compass.
Pipeline Modes¶
Atlas supports two pipeline modes selected per agent:
| Mode | Description |
|---|---|
stt_llm_tts |
Standard pipeline — STT transcript → LLM text → TTS audio |
realtime |
Audio-native model (e.g. Gemini Live, OpenAI Realtime) — audio in, audio out; no separate STT/TTS |
Supported Providers (STT-LLM-TTS mode)¶
Configured via voice_config.llm_model:
| Provider | Models | Notes |
|---|---|---|
google |
gemini-2.5-flash, gemini-3-flash |
Default; tool calling, vision capable |
openai |
gpt-4o, gpt-4o-mini |
Tool calling, structured output |
| Custom endpoint | Any OpenAI-compatible API | HuggingFace, Together AI, Groq |
Supported Providers (Realtime mode)¶
| Provider | Models | Notes |
|---|---|---|
google |
Gemini Live (gemini-2.5-flash-live-preview) |
Audio-native, low latency |
openai |
OpenAI Realtime API | Audio-native |
qwen |
Qwen Omni | Audio-native |
Input / Output¶
Input to LLM (per turn):
ChatContext:
[system] assembled agent_prompt (from Compass prompt components)
[user] STT transcript of current utterance
[history] previous assistant + user messages (sliding window)
[tools] registered tool definitions from agent_skills
Output from LLM:
- Plain response text → passed to TTS Service
- Tool call request → Atlas executes tool, result fed back to LLM, then final response to TTS
- Escalation signal → Atlas triggers
transfer_to_human_agent
Prompt Assembly¶
The LLM system message is assembled by Atlas from the prompt components returned by Compass, sorted by priority:
| Priority | Component | Content |
|---|---|---|
| 10–30 | base |
Core agent identity |
| 11 | base_prompt_voice |
Voice channel rules |
| 60 | language |
Per-language conversation style |
| 90 | escalation |
Human handoff rules |
| 100 | guard |
Hallucination / safety guardrails |
| 110+ | tool_prompt |
Per-tool usage instructions |
For realtime mode, Compass returns a pre-assembled prompt_realtime string that is injected immutably at session start.
Tool Calling¶
Atlas registers active agent skills (from Compass agent_skills) as RawFunctionTools available to the LLM each turn. Execution is handled by MCPManager:
- LLM emits a tool call request
- Atlas routes to the relevant MCP server or local tool
- Tool result is injected back into context as a tool response message
- LLM generates the final response
Local tools always available:
| Tool | Purpose |
|---|---|
transfer_to_human_agent |
Escalate to human agent queue |
switch_language |
Change conversation language mid-call |
end_call |
Gracefully terminate the call |
search_knowledge |
Query agent's Qdrant knowledge base |
MCPManager deduplicates tool calls within a 30 s window to prevent concurrent duplicate executions.
Guardrails¶
- Pre-generation: Prompt validation applied at agent creation time in Compass (malicious content check)
- Post-generation:
guardprompt components instruct the model to avoid hallucinations and enforce tone policy - Escalation: If the model or the
guardcomponent determines the query cannot be handled, it emits an escalation signal to triggertransfer_to_human_agent
Observability¶
- All tool inputs and outputs logged at DEBUG level with
call_idcontext - LangFuse / OpenTelemetry traces: session ID =
call_id, user = caller phone number - Recoverable LLM errors: Atlas speaks a filler phrase while the SDK retries internally