Herald — Text AI Runtime¶
Herald is a stateless, channel-agnostic AI runtime for text-based customer interactions. It processes conversations in discrete turns: accept one HTTP request containing a user message, run an AI agent turn (with optional MCP tools and knowledge base), and stream typed reply events back to the caller's webhook. The caller — typically the WhatsApp Service or a chat adapter — translates those events into whatever transport-specific actions it needs.
Herald is the text-channel equivalent of Atlas. The same Compass agent configuration powers both; Herald handles WhatsApp, SMS, and web chat while Atlas handles voice calls.
Source
PycharmProjects/AI/herald — Python 3.11 / FastAPI
Tech Stack¶
| Layer | Technology |
|---|---|
| Framework | FastAPI 0.115 + Uvicorn 0.32 |
| Python | 3.11 |
| Data validation | Pydantic v2 + pydantic-settings |
| HTTP client | httpx (async) |
| LLM providers | OpenAI SDK 1.76 · google-genai 1.0 · Azure OpenAI |
| Agent tools | MCP 1.2 (Model Context Protocol, streamable-HTTP) |
| Geocoding | Google Maps API · OSM Nominatim (fallback) |
| Observability | loguru · Langfuse 2.x (optional tracing) |
Architecture¶
graph TD
Caller["Caller\n(WhatsApp Service / Chat Adapter)"]
subgraph Herald
API["POST /turn\n(FastAPI)"]
DISP["TurnDispatcher\n(semaphore + per-conv locks)"]
EXEC["TurnExecutor\n(orchestrator)"]
LLM["LlmClient\n(OpenAI / Azure / Gemini)"]
PB["PromptBuilder"]
BTOOL["Builtin Tools\n(KB, transfer, language…)"]
MCP["McpClient\n(agent MCP server)"]
EE["EventEmitter"]
CB["CallbackClient\n(retries + idempotency)"]
end
subgraph External
COMPASS["Compass\n(agent config + KB)"]
MCPSRV["Agent MCP Server\n(per-agent)"]
LLMPROV["LLM Provider\n(OpenAI / Azure / Gemini)"]
MAPS["Geocoding\n(Maps API / OSM)"]
WEBHOOK["Caller Webhook"]
end
Caller -->|POST /turn| API
API --> DISP
DISP --> EXEC
EXEC -->|get_agent / search_KB| COMPASS
EXEC --> LLM --> LLMPROV
EXEC --> PB
EXEC --> BTOOL --> MAPS
EXEC --> MCP --> MCPSRV
EXEC --> EE --> CB --> WEBHOOK
On startup the app creates shared CompassClient, CallbackClient, MapsClient, TurnExecutor, and TurnDispatcher instances, attaching them to app.state. On shutdown the dispatcher drains inflight turns within the configured grace period.
Turn Lifecycle¶
sequenceDiagram
participant Caller
participant Herald
participant Compass
participant LLM
participant MCP
Caller->>Herald: POST /turn {agent_id, message, history, callback_url}
Herald-->>Caller: 202 Accepted {turn_id}
Herald->>Compass: GET agent config
Herald->>Caller: typing(on) → callback_url
alt MCP server configured
Herald->>MCP: list_tools()
loop tool-call loop (max_iterations)
Herald->>LLM: chat(messages, tools)
LLM-->>Herald: tool calls or final text
Herald->>MCP: call_tool(name, args)
end
else KB-only
Herald->>LLM: chat(messages, builtin_tools)
LLM-->>Herald: final text
end
Herald->>Caller: typing(off) → callback_url
Herald->>Caller: message(text, final=true) → callback_url
Herald->>Caller: done(latency_ms, escalate, resolved…) → callback_url
Per-conversation ordering is enforced by ConversationLockMap — two messages from the same conversation_id never process in parallel. Global LLM concurrency is bounded by a semaphore (MAX_CONCURRENT_TURNS).
API¶
POST /turn — Submit a turn¶
Returns 202 Accepted immediately; all reply events are POSTed to callback_url.
Required header: x-tenantId
Request body:
{
"conversation_id": "c-123",
"agent_id": "<uuid>",
"message": "What are your opening hours?",
"history": [
{"role": "user", "text": "Hi"},
{"role": "assistant", "text": "Hello! How can I help?"}
],
"contact": {
"id": "cust-456",
"name": "Alice",
"phone": "+441234567890",
"language": "en-GB"
},
"new_conversation": false,
"callback_url": "https://wa-service.internal/whatsapp/messages/reply",
"channel": "whatsapp",
"language": "en-GB",
"transferable": true
}
Response:
503 when the dispatcher is at capacity (MAX_INFLIGHT_TURNS) or shutting down.
Callback events¶
All events are POSTed to callback_url with these headers:
X-Idempotency-Key: <turn_id>:<seq>
X-Turn-Id: <turn_id>
Authorization: Bearer <CALLBACK_SHARED_SECRET> # if secret is configured
Content-Type: application/json
seq increments monotonically per turn. Receivers can deduplicate on turn_id:seq.
Delivery uses exponential back-off (1 s → 2 s → 4 s), retrying on 5xx and transport errors. 4xx responses fail immediately.
typing¶
{"event": "typing", "turn_id": "…", "conversation_id": "…", "seq": 1, "state": "on"}
{"event": "typing", "state": "off", …}
message¶
{
"event": "message",
"turn_id": "…", "conversation_id": "…", "seq": 2,
"text": "Our hours are Monday–Friday, 9 am to 5 pm.",
"final": true,
"interim": false
}
interim: true marks a pre-tool acknowledgement ("Let me check that for you…") sent while a tool call is in progress. final: true marks the last message of the turn.
done¶
{
"event": "done",
"turn_id": "…", "conversation_id": "…", "seq": 3,
"latency_ms": 1240,
"escalate": false,
"escalate_reason": null,
"resolved": false,
"contact_updates": {"name": "Alice", "language": "en-GB"},
"summary": "User asked about opening hours."
}
escalate_reason values: agent_unavailable · llm_unavailable · timeout · tool_loop_limit · internal_error
resolved: true when the end_conversation tool was called.
contact_updates carries any contact fields the LLM captured during the turn (name, phone, email, language) so the caller can persist them.
error¶
Fatal failure; no done event follows.
location_request¶
{"event": "location_request", "text": "Please share your location so I can find your nearest branch.", …}
GET /health¶
{
"status": "ok",
"dispatcher": {
"inflight": 5,
"active_conversations": 3,
"max_concurrent": 100,
"max_inflight": 1000
}
}
Used by Kubernetes liveness and readiness probes.
GET /agents/{agent_id}/greeting¶
Returns the agent's greeting text for a given language.
Required header: x-tenantId
Optional query param: language_code (BCP-47, e.g. en-GB, es, fr-FR)
{
"agent_id": "…",
"language_code": "en-GB",
"greeting": "Hi! Welcome to Kings Dental Center. How can I help you today?"
}
The greeting is resolved from the agent's base_greeting field in Compass — either a string (single language) or a {language_code: text} dict.
Built-in Tools¶
Herald provides a set of agent-neutral function tools that the LLM can call regardless of which MCP server (if any) the agent uses:
| Tool | Action |
|---|---|
search_knowledge |
Semantic KB search via Compass; top-N chunks injected into context |
transfer_to_human_agent |
Raise TransferRequested; triggers escalate: true in done event |
switch_language |
Acknowledge language switch in reply |
report_user_language |
Signal detected language; recorded in contact_updates |
report_clarification_failure |
Signal repeated failure to understand input |
end_conversation |
Raise ConversationEnded; triggers resolved: true in done event |
find_nearest_branch |
Geocode user address, calculate distances, return nearest location |
capture_contact_info |
Store name/phone/email/language from user; included in done.contact_updates |
request_location |
Raise LocationRequested; emits location_request event |
When an MCP server is configured, its tools are merged with the built-ins. On MCP session failure, Herald falls back to built-ins only.
MCP Tool Integration¶
Each agent in Compass can have an mcp_server_url. On each turn:
- Herald opens a streamable-HTTP MCP session to that URL.
list_tools()returns the agent-specific tool catalogue (e.g. appointment booking, order lookup, policy query).- These are merged with the built-in tools and passed to the LLM.
- The LLM issues tool calls; Herald dispatches each to the MCP server via
call_tool(name, args). - Results are appended to the message list and the LLM is called again (up to
MCP_MAX_ITERATIONS).
If the MCP session can't be opened, Herald logs a warning and continues with KB-only tools.
LLM Providers¶
The active provider is selected by DEFAULT_LLM_PROVIDER. All providers support text completion and function calling.
| Provider | Default model | Notes |
|---|---|---|
openai |
gpt-4o-mini |
Native OpenAI SDK |
azure |
gpt-4 |
Custom deployment + API version |
google |
gemini-2.5-flash |
Native google-genai SDK; OpenAI-compat fallback |
Per-agent model override: if AgentConfig.model is set in Compass, that model is used for that agent's turns regardless of the global default.
Langfuse tracing (optional): when LANGFUSE_TRACING=true, all LLM calls are captured in Langfuse for cost tracking and prompt evaluation.
Compass Integration¶
Herald calls two Compass endpoints on every turn:
| Call | When | On failure |
|---|---|---|
GET /tenants/agents/{agent_id} |
Start of every turn | CompassError → escalate turn |
POST /knowledge/search |
When search_knowledge tool is called |
Return empty results (graceful degradation) |
The AgentConfig from Compass drives:
- System prompt (agent_prompt)
- Hallucination guard prompt
- MCP server URL
- Supported languages
- Base greeting
- Tool-specific prompt fragments (tool_prompts)
Herald vs Atlas¶
Both services use the same Compass agent configuration. The same agent record powers both channels simultaneously.
| Atlas | Herald | |
|---|---|---|
| Channel | Voice (phone calls) | Text (WhatsApp, SMS, web chat) |
| Processing | Real-time streaming, call state machine | Async request-reply, stateless turns |
| Server state | Call-stateful (active calls in memory) | Stateless (each turn is independent) |
| Format guidance | Voice prosody, SSML hints | WhatsApp markdown rules, [SPLIT] message splitting |
| Location tool | Not applicable | request_location → location_request event |
| Greeting delivery | Immediate TTS on call connect | Text event; caller decides timing |
Configuration Reference¶
| Variable | Default | Purpose |
|---|---|---|
SERVICE_PORT |
8090 |
Uvicorn listen port |
LOG_LEVEL |
INFO |
Logging level |
COMPASS_URL |
(required) | Compass base URL |
COMPASS_TIMEOUT |
10.0 |
Compass request timeout (seconds) |
DEFAULT_LLM_PROVIDER |
openai |
Active LLM provider: openai · azure · google |
OPENAI_API_KEY |
— | OpenAI API key |
OPENAI_DEFAULT_MODEL |
gpt-4o-mini |
Default OpenAI model |
GOOGLE_API_KEY |
— | Google Gemini API key |
GOOGLE_DEFAULT_MODEL |
gemini-2.5-flash |
Default Gemini model |
GOOGLE_MAPS_API_KEY |
— | Google Maps geocoding key (falls back to OSM if absent) |
AZURE_OPENAI_ENDPOINT |
— | Azure OpenAI endpoint URL |
AZURE_OPENAI_API_KEY |
— | Azure OpenAI API key |
AZURE_OPENAI_API_VERSION |
2025-02-01-preview |
Azure API version |
AZURE_OPENAI_DEPLOYMENT |
gpt-4 |
Azure deployment name |
KB_ENABLED |
true |
Enable search_knowledge tool |
KB_SCORE_THRESHOLD |
0.5 |
Minimum relevance score for KB chunks |
KB_LIMIT |
6 |
Max KB chunks to include in context |
HISTORY_MAX_TURNS |
20 |
Conversation history window |
LLM_MAX_TOKENS |
1024 |
Maximum output tokens |
LLM_TEMPERATURE |
0.3 |
LLM temperature |
MCP_ENABLED |
true |
Enable agent MCP tools |
MCP_MAX_ITERATIONS |
5 |
Maximum tool-call loop iterations per turn |
MAX_CONCURRENT_TURNS |
100 |
Parallel LLM executions (semaphore) |
MAX_INFLIGHT_TURNS |
1000 |
Total queued + running tasks before 503 |
TURN_TIMEOUT_SECONDS |
60.0 |
Per-turn deadline |
SHUTDOWN_DRAIN_SECONDS |
30.0 |
Graceful shutdown drain window |
CALLBACK_TIMEOUT_SECONDS |
5.0 |
Per-callback-request timeout |
CALLBACK_MAX_RETRIES |
3 |
Exponential back-off retry count |
CALLBACK_SHARED_SECRET |
— | Bearer token added to callback Authorization header |
LANGFUSE_TRACING |
false |
Enable Langfuse LLM tracing |
LANGFUSE_PUBLIC_KEY |
— | Langfuse public key |
LANGFUSE_SECRET_KEY |
— | Langfuse secret key |
LANGFUSE_HOST |
http://localhost:3000 |
Langfuse host |
Key Design Decisions¶
Stateless turns — Herald holds no conversation state between turns. All context (history, contact) is passed in the request, making the service horizontally scalable without sticky sessions.
Per-conversation ordering — ConversationLockMap uses asyncio locks keyed by conversation_id. If two messages from the same conversation arrive before the first turn completes, the second queues behind the first. This prevents out-of-order replies without requiring a message broker.
Bounded concurrency — A global semaphore caps parallel LLM calls at MAX_CONCURRENT_TURNS. Total queued tasks are capped at MAX_INFLIGHT_TURNS. Both limits return 503 when exceeded so the caller can back-pressure rather than overload the LLM API.
Graceful degradation — MCP failure falls back to KB-only tools. KB search failure returns empty results. Both keep the turn alive rather than failing hard.
Idempotent callbacks — Every event carries turn_id:seq as an idempotency key. Receivers that store this key can safely tolerate re-delivery without duplicating messages.