Herald — Text AI Runtime¶

Herald is a stateless, channel-agnostic AI runtime for text-based customer interactions. It processes conversations in discrete turns: accept one HTTP request containing a user message, run an AI agent turn (with optional MCP tools and knowledge base), and stream typed reply events back to the caller's webhook. The caller — typically the WhatsApp Service or a chat adapter — translates those events into whatever transport-specific actions it needs.

Herald is the text-channel equivalent of Atlas. The same Compass agent configuration powers both; Herald handles WhatsApp, SMS, and web chat while Atlas handles voice calls.

Source

PycharmProjects/AI/herald — Python 3.11 / FastAPI

Tech Stack¶

Layer	Technology
Framework	FastAPI 0.115 + Uvicorn 0.32
Python	3.11
Data validation	Pydantic v2 + pydantic-settings
HTTP client	httpx (async)
LLM providers	OpenAI SDK 1.76 · google-genai 1.0 · Azure OpenAI
Agent tools	MCP 1.2 (Model Context Protocol, streamable-HTTP)
Geocoding	Google Maps API · OSM Nominatim (fallback)
Observability	loguru · Langfuse 2.x (optional tracing)

Architecture¶

graph TD
    Caller["Caller\n(WhatsApp Service / Chat Adapter)"]

    subgraph Herald
        API["POST /turn\n(FastAPI)"]
        DISP["TurnDispatcher\n(semaphore + per-conv locks)"]
        EXEC["TurnExecutor\n(orchestrator)"]
        LLM["LlmClient\n(OpenAI / Azure / Gemini)"]
        PB["PromptBuilder"]
        BTOOL["Builtin Tools\n(KB, transfer, language…)"]
        MCP["McpClient\n(agent MCP server)"]
        EE["EventEmitter"]
        CB["CallbackClient\n(retries + idempotency)"]
    end

    subgraph External
        COMPASS["Compass\n(agent config + KB)"]
        MCPSRV["Agent MCP Server\n(per-agent)"]
        LLMPROV["LLM Provider\n(OpenAI / Azure / Gemini)"]
        MAPS["Geocoding\n(Maps API / OSM)"]
        WEBHOOK["Caller Webhook"]
    end

    Caller -->|POST /turn| API
    API --> DISP
    DISP --> EXEC
    EXEC -->|get_agent / search_KB| COMPASS
    EXEC --> LLM --> LLMPROV
    EXEC --> PB
    EXEC --> BTOOL --> MAPS
    EXEC --> MCP --> MCPSRV
    EXEC --> EE --> CB --> WEBHOOK

On startup the app creates shared CompassClient, CallbackClient, MapsClient, TurnExecutor, and TurnDispatcher instances, attaching them to app.state. On shutdown the dispatcher drains inflight turns within the configured grace period.

Turn Lifecycle¶

sequenceDiagram
    participant Caller
    participant Herald
    participant Compass
    participant LLM
    participant MCP

    Caller->>Herald: POST /turn {agent_id, message, history, callback_url}
    Herald-->>Caller: 202 Accepted {turn_id}

    Herald->>Compass: GET agent config
    Herald->>Caller: typing(on) → callback_url

    alt MCP server configured
        Herald->>MCP: list_tools()
        loop tool-call loop (max_iterations)
            Herald->>LLM: chat(messages, tools)
            LLM-->>Herald: tool calls or final text
            Herald->>MCP: call_tool(name, args)
        end
    else KB-only
        Herald->>LLM: chat(messages, builtin_tools)
        LLM-->>Herald: final text
    end

    Herald->>Caller: typing(off) → callback_url
    Herald->>Caller: message(text, final=true) → callback_url
    Herald->>Caller: done(latency_ms, escalate, resolved…) → callback_url

Per-conversation ordering is enforced by ConversationLockMap — two messages from the same conversation_id never process in parallel. Global LLM concurrency is bounded by a semaphore (MAX_CONCURRENT_TURNS).

API¶

`POST /turn` — Submit a turn¶

Returns 202 Accepted immediately; all reply events are POSTed to callback_url.

Required header: x-tenantId

Request body:

{
  "conversation_id": "c-123",
  "agent_id": "<uuid>",
  "message": "What are your opening hours?",
  "history": [
    {"role": "user",      "text": "Hi"},
    {"role": "assistant", "text": "Hello! How can I help?"}
  ],
  "contact": {
    "id": "cust-456",
    "name": "Alice",
    "phone": "+441234567890",
    "language": "en-GB"
  },
  "new_conversation": false,
  "callback_url": "https://wa-service.internal/whatsapp/messages/reply",
  "channel": "whatsapp",
  "language": "en-GB",
  "transferable": true
}

Response:

{"turn_id": "a1b2c3d4...", "status": "accepted"}

503 when the dispatcher is at capacity (MAX_INFLIGHT_TURNS) or shutting down.

Callback events¶

All events are POSTed to callback_url with these headers:

X-Idempotency-Key: <turn_id>:<seq>
X-Turn-Id: <turn_id>
Authorization: Bearer <CALLBACK_SHARED_SECRET>   # if secret is configured
Content-Type: application/json

seq increments monotonically per turn. Receivers can deduplicate on turn_id:seq.

Delivery uses exponential back-off (1 s → 2 s → 4 s), retrying on 5xx and transport errors. 4xx responses fail immediately.

`typing`¶

{"event": "typing", "turn_id": "…", "conversation_id": "…", "seq": 1, "state": "on"}
{"event": "typing", "state": "off", …}

`message`¶

{
  "event": "message",
  "turn_id": "…", "conversation_id": "…", "seq": 2,
  "text": "Our hours are Monday–Friday, 9 am to 5 pm.",
  "final": true,
  "interim": false
}

interim: true marks a pre-tool acknowledgement ("Let me check that for you…") sent while a tool call is in progress. final: true marks the last message of the turn.

`done`¶

{
  "event": "done",
  "turn_id": "…", "conversation_id": "…", "seq": 3,
  "latency_ms": 1240,
  "escalate": false,
  "escalate_reason": null,
  "resolved": false,
  "contact_updates": {"name": "Alice", "language": "en-GB"},
  "summary": "User asked about opening hours."
}

escalate_reason values: agent_unavailable · llm_unavailable · timeout · tool_loop_limit · internal_error

resolved: true when the end_conversation tool was called.

contact_updates carries any contact fields the LLM captured during the turn (name, phone, email, language) so the caller can persist them.

`error`¶

{"event": "error", "reason": "Agent configuration not found", …}

Fatal failure; no done event follows.

`location_request`¶

{"event": "location_request", "text": "Please share your location so I can find your nearest branch.", …}

`GET /health`¶

{
  "status": "ok",
  "dispatcher": {
    "inflight": 5,
    "active_conversations": 3,
    "max_concurrent": 100,
    "max_inflight": 1000
  }
}

Used by Kubernetes liveness and readiness probes.

`GET /agents/{agent_id}/greeting`¶

Returns the agent's greeting text for a given language.

Required header: x-tenantId
Optional query param: language_code (BCP-47, e.g. en-GB, es, fr-FR)

{
  "agent_id": "…",
  "language_code": "en-GB",
  "greeting": "Hi! Welcome to Kings Dental Center. How can I help you today?"
}

The greeting is resolved from the agent's base_greeting field in Compass — either a string (single language) or a {language_code: text} dict.

Built-in Tools¶

Herald provides a set of agent-neutral function tools that the LLM can call regardless of which MCP server (if any) the agent uses:

Tool	Action
`search_knowledge`	Semantic KB search via Compass; top-N chunks injected into context
`transfer_to_human_agent`	Raise `TransferRequested`; triggers `escalate: true` in `done` event
`switch_language`	Acknowledge language switch in reply
`report_user_language`	Signal detected language; recorded in `contact_updates`
`report_clarification_failure`	Signal repeated failure to understand input
`end_conversation`	Raise `ConversationEnded`; triggers `resolved: true` in `done` event
`find_nearest_branch`	Geocode user address, calculate distances, return nearest location
`capture_contact_info`	Store name/phone/email/language from user; included in `done.contact_updates`
`request_location`	Raise `LocationRequested`; emits `location_request` event

When an MCP server is configured, its tools are merged with the built-ins. On MCP session failure, Herald falls back to built-ins only.

MCP Tool Integration¶

Each agent in Compass can have an mcp_server_url. On each turn:

Herald opens a streamable-HTTP MCP session to that URL.
list_tools() returns the agent-specific tool catalogue (e.g. appointment booking, order lookup, policy query).
These are merged with the built-in tools and passed to the LLM.
The LLM issues tool calls; Herald dispatches each to the MCP server via call_tool(name, args).
Results are appended to the message list and the LLM is called again (up to MCP_MAX_ITERATIONS).

If the MCP session can't be opened, Herald logs a warning and continues with KB-only tools.

LLM Providers¶

The active provider is selected by DEFAULT_LLM_PROVIDER. All providers support text completion and function calling.

Provider	Default model	Notes
`openai`	`gpt-4o-mini`	Native OpenAI SDK
`azure`	`gpt-4`	Custom deployment + API version
`google`	`gemini-2.5-flash`	Native google-genai SDK; OpenAI-compat fallback

Per-agent model override: if AgentConfig.model is set in Compass, that model is used for that agent's turns regardless of the global default.

Langfuse tracing (optional): when LANGFUSE_TRACING=true, all LLM calls are captured in Langfuse for cost tracking and prompt evaluation.

Compass Integration¶

Herald calls two Compass endpoints on every turn:

Call	When	On failure
`GET /tenants/agents/{agent_id}`	Start of every turn	`CompassError` → escalate turn
`POST /knowledge/search`	When `search_knowledge` tool is called	Return empty results (graceful degradation)

The AgentConfig from Compass drives: - System prompt (agent_prompt) - Hallucination guard prompt - MCP server URL - Supported languages - Base greeting - Tool-specific prompt fragments (tool_prompts)

Herald vs Atlas¶

Both services use the same Compass agent configuration. The same agent record powers both channels simultaneously.

	Atlas	Herald
Channel	Voice (phone calls)	Text (WhatsApp, SMS, web chat)
Processing	Real-time streaming, call state machine	Async request-reply, stateless turns
Server state	Call-stateful (active calls in memory)	Stateless (each turn is independent)
Format guidance	Voice prosody, SSML hints	WhatsApp markdown rules, `[SPLIT]` message splitting
Location tool	Not applicable	`request_location` → `location_request` event
Greeting delivery	Immediate TTS on call connect	Text event; caller decides timing

Configuration Reference¶

Variable	Default	Purpose
`SERVICE_PORT`	`8090`	Uvicorn listen port
`LOG_LEVEL`	`INFO`	Logging level
`COMPASS_URL`	(required)	Compass base URL
`COMPASS_TIMEOUT`	`10.0`	Compass request timeout (seconds)
`DEFAULT_LLM_PROVIDER`	`openai`	Active LLM provider: `openai` · `azure` · `google`
`OPENAI_API_KEY`	—	OpenAI API key
`OPENAI_DEFAULT_MODEL`	`gpt-4o-mini`	Default OpenAI model
`GOOGLE_API_KEY`	—	Google Gemini API key
`GOOGLE_DEFAULT_MODEL`	`gemini-2.5-flash`	Default Gemini model
`GOOGLE_MAPS_API_KEY`	—	Google Maps geocoding key (falls back to OSM if absent)
`AZURE_OPENAI_ENDPOINT`	—	Azure OpenAI endpoint URL
`AZURE_OPENAI_API_KEY`	—	Azure OpenAI API key
`AZURE_OPENAI_API_VERSION`	`2025-02-01-preview`	Azure API version
`AZURE_OPENAI_DEPLOYMENT`	`gpt-4`	Azure deployment name
`KB_ENABLED`	`true`	Enable `search_knowledge` tool
`KB_SCORE_THRESHOLD`	`0.5`	Minimum relevance score for KB chunks
`KB_LIMIT`	`6`	Max KB chunks to include in context
`HISTORY_MAX_TURNS`	`20`	Conversation history window
`LLM_MAX_TOKENS`	`1024`	Maximum output tokens
`LLM_TEMPERATURE`	`0.3`	LLM temperature
`MCP_ENABLED`	`true`	Enable agent MCP tools
`MCP_MAX_ITERATIONS`	`5`	Maximum tool-call loop iterations per turn
`MAX_CONCURRENT_TURNS`	`100`	Parallel LLM executions (semaphore)
`MAX_INFLIGHT_TURNS`	`1000`	Total queued + running tasks before 503
`TURN_TIMEOUT_SECONDS`	`60.0`	Per-turn deadline
`SHUTDOWN_DRAIN_SECONDS`	`30.0`	Graceful shutdown drain window
`CALLBACK_TIMEOUT_SECONDS`	`5.0`	Per-callback-request timeout
`CALLBACK_MAX_RETRIES`	`3`	Exponential back-off retry count
`CALLBACK_SHARED_SECRET`	—	Bearer token added to callback `Authorization` header
`LANGFUSE_TRACING`	`false`	Enable Langfuse LLM tracing
`LANGFUSE_PUBLIC_KEY`	—	Langfuse public key
`LANGFUSE_SECRET_KEY`	—	Langfuse secret key
`LANGFUSE_HOST`	`http://localhost:3000`	Langfuse host

Key Design Decisions¶

Stateless turns — Herald holds no conversation state between turns. All context (history, contact) is passed in the request, making the service horizontally scalable without sticky sessions.

Per-conversation ordering — ConversationLockMap uses asyncio locks keyed by conversation_id. If two messages from the same conversation arrive before the first turn completes, the second queues behind the first. This prevents out-of-order replies without requiring a message broker.

Bounded concurrency — A global semaphore caps parallel LLM calls at MAX_CONCURRENT_TURNS. Total queued tasks are capped at MAX_INFLIGHT_TURNS. Both limits return 503 when exceeded so the caller can back-pressure rather than overload the LLM API.

Graceful degradation — MCP failure falls back to KB-only tools. KB search failure returns empty results. Both keep the turn alive rather than failing hard.

Idempotent callbacks — Every event carries turn_id:seq as an idempotency key. Receivers that store this key can safely tolerate re-delivery without duplicating messages.