Skip to content

LLM Service

LLM inference in Nexivo is implemented inside Atlas (~/PycharmProjects/AI/atlas/src/voice_pipeline). There is no separate LLM microservice — Atlas manages the full conversation context and calls the configured LLM provider directly.

The LLM provider is configured per agent via voice_config.llm_model and voice_config.pipeline_mode in Compass.


Pipeline Modes

Atlas supports two pipeline modes selected per agent:

Mode Description
stt_llm_tts Standard pipeline — STT transcript → LLM text → TTS audio
realtime Audio-native model (e.g. Gemini Live, OpenAI Realtime) — audio in, audio out; no separate STT/TTS

Supported Providers (STT-LLM-TTS mode)

Configured via voice_config.llm_model:

Provider Models Notes
google gemini-2.5-flash, gemini-3-flash Default; tool calling, vision capable
openai gpt-4o, gpt-4o-mini Tool calling, structured output
Custom endpoint Any OpenAI-compatible API HuggingFace, Together AI, Groq

Supported Providers (Realtime mode)

Provider Models Notes
google Gemini Live (gemini-2.5-flash-live-preview) Audio-native, low latency
openai OpenAI Realtime API Audio-native
qwen Qwen Omni Audio-native

Input / Output

Input to LLM (per turn):

ChatContext:
  [system]   assembled agent_prompt (from Compass prompt components)
  [user]     STT transcript of current utterance
  [history]  previous assistant + user messages (sliding window)
  [tools]    registered tool definitions from agent_skills

Output from LLM:

  • Plain response text → passed to TTS Service
  • Tool call request → Atlas executes tool, result fed back to LLM, then final response to TTS
  • Escalation signal → Atlas triggers transfer_to_human_agent

Prompt Assembly

The LLM system message is assembled by Atlas from the prompt components returned by Compass, sorted by priority:

Priority Component Content
10–30 base Core agent identity
11 base_prompt_voice Voice channel rules
60 language Per-language conversation style
90 escalation Human handoff rules
100 guard Hallucination / safety guardrails
110+ tool_prompt Per-tool usage instructions

For realtime mode, Compass returns a pre-assembled prompt_realtime string that is injected immutably at session start.


Tool Calling

Atlas registers active agent skills (from Compass agent_skills) as RawFunctionTools available to the LLM each turn. Execution is handled by MCPManager:

  1. LLM emits a tool call request
  2. Atlas routes to the relevant MCP server or local tool
  3. Tool result is injected back into context as a tool response message
  4. LLM generates the final response

Local tools always available:

Tool Purpose
transfer_to_human_agent Escalate to human agent queue
switch_language Change conversation language mid-call
end_call Gracefully terminate the call
search_knowledge Query agent's Qdrant knowledge base

MCPManager deduplicates tool calls within a 30 s window to prevent concurrent duplicate executions.


Guardrails

  • Pre-generation: Prompt validation applied at agent creation time in Compass (malicious content check)
  • Post-generation: guard prompt components instruct the model to avoid hallucinations and enforce tone policy
  • Escalation: If the model or the guard component determines the query cannot be handled, it emits an escalation signal to trigger transfer_to_human_agent

Observability

  • All tool inputs and outputs logged at DEBUG level with call_id context
  • LangFuse / OpenTelemetry traces: session ID = call_id, user = caller phone number
  • Recoverable LLM errors: Atlas speaks a filler phrase while the SDK retries internally