Skip to content

Herald — Text AI Runtime

Herald is a stateless, channel-agnostic AI runtime for text-based customer interactions. It processes conversations in discrete turns: accept one HTTP request containing a user message, run an AI agent turn (with optional MCP tools and knowledge base), and stream typed reply events back to the caller's webhook. The caller — typically the WhatsApp Service or a chat adapter — translates those events into whatever transport-specific actions it needs.

Herald is the text-channel equivalent of Atlas. The same Compass agent configuration powers both; Herald handles WhatsApp, SMS, and web chat while Atlas handles voice calls.

Source

PycharmProjects/AI/herald — Python 3.11 / FastAPI


Tech Stack

Layer Technology
Framework FastAPI 0.115 + Uvicorn 0.32
Python 3.11
Data validation Pydantic v2 + pydantic-settings
HTTP client httpx (async)
LLM providers OpenAI SDK 1.76 · google-genai 1.0 · Azure OpenAI
Agent tools MCP 1.2 (Model Context Protocol, streamable-HTTP)
Geocoding Google Maps API · OSM Nominatim (fallback)
Observability loguru · Langfuse 2.x (optional tracing)

Architecture

graph TD
    Caller["Caller\n(WhatsApp Service / Chat Adapter)"]

    subgraph Herald
        API["POST /turn\n(FastAPI)"]
        DISP["TurnDispatcher\n(semaphore + per-conv locks)"]
        EXEC["TurnExecutor\n(orchestrator)"]
        LLM["LlmClient\n(OpenAI / Azure / Gemini)"]
        PB["PromptBuilder"]
        BTOOL["Builtin Tools\n(KB, transfer, language…)"]
        MCP["McpClient\n(agent MCP server)"]
        EE["EventEmitter"]
        CB["CallbackClient\n(retries + idempotency)"]
    end

    subgraph External
        COMPASS["Compass\n(agent config + KB)"]
        MCPSRV["Agent MCP Server\n(per-agent)"]
        LLMPROV["LLM Provider\n(OpenAI / Azure / Gemini)"]
        MAPS["Geocoding\n(Maps API / OSM)"]
        WEBHOOK["Caller Webhook"]
    end

    Caller -->|POST /turn| API
    API --> DISP
    DISP --> EXEC
    EXEC -->|get_agent / search_KB| COMPASS
    EXEC --> LLM --> LLMPROV
    EXEC --> PB
    EXEC --> BTOOL --> MAPS
    EXEC --> MCP --> MCPSRV
    EXEC --> EE --> CB --> WEBHOOK

On startup the app creates shared CompassClient, CallbackClient, MapsClient, TurnExecutor, and TurnDispatcher instances, attaching them to app.state. On shutdown the dispatcher drains inflight turns within the configured grace period.


Turn Lifecycle

sequenceDiagram
    participant Caller
    participant Herald
    participant Compass
    participant LLM
    participant MCP

    Caller->>Herald: POST /turn {agent_id, message, history, callback_url}
    Herald-->>Caller: 202 Accepted {turn_id}

    Herald->>Compass: GET agent config
    Herald->>Caller: typing(on) → callback_url

    alt MCP server configured
        Herald->>MCP: list_tools()
        loop tool-call loop (max_iterations)
            Herald->>LLM: chat(messages, tools)
            LLM-->>Herald: tool calls or final text
            Herald->>MCP: call_tool(name, args)
        end
    else KB-only
        Herald->>LLM: chat(messages, builtin_tools)
        LLM-->>Herald: final text
    end

    Herald->>Caller: typing(off) → callback_url
    Herald->>Caller: message(text, final=true) → callback_url
    Herald->>Caller: done(latency_ms, escalate, resolved…) → callback_url

Per-conversation ordering is enforced by ConversationLockMap — two messages from the same conversation_id never process in parallel. Global LLM concurrency is bounded by a semaphore (MAX_CONCURRENT_TURNS).


API

POST /turn — Submit a turn

Returns 202 Accepted immediately; all reply events are POSTed to callback_url.

Required header: x-tenantId

Request body:

{
  "conversation_id": "c-123",
  "agent_id": "<uuid>",
  "message": "What are your opening hours?",
  "history": [
    {"role": "user",      "text": "Hi"},
    {"role": "assistant", "text": "Hello! How can I help?"}
  ],
  "contact": {
    "id": "cust-456",
    "name": "Alice",
    "phone": "+441234567890",
    "language": "en-GB"
  },
  "new_conversation": false,
  "callback_url": "https://wa-service.internal/whatsapp/messages/reply",
  "channel": "whatsapp",
  "language": "en-GB",
  "transferable": true
}

Response:

{"turn_id": "a1b2c3d4...", "status": "accepted"}

503 when the dispatcher is at capacity (MAX_INFLIGHT_TURNS) or shutting down.


Callback events

All events are POSTed to callback_url with these headers:

X-Idempotency-Key: <turn_id>:<seq>
X-Turn-Id: <turn_id>
Authorization: Bearer <CALLBACK_SHARED_SECRET>   # if secret is configured
Content-Type: application/json

seq increments monotonically per turn. Receivers can deduplicate on turn_id:seq.

Delivery uses exponential back-off (1 s → 2 s → 4 s), retrying on 5xx and transport errors. 4xx responses fail immediately.

typing

{"event": "typing", "turn_id": "…", "conversation_id": "…", "seq": 1, "state": "on"}
{"event": "typing", "state": "off", }

message

{
  "event": "message",
  "turn_id": "…", "conversation_id": "…", "seq": 2,
  "text": "Our hours are Monday–Friday, 9 am to 5 pm.",
  "final": true,
  "interim": false
}

interim: true marks a pre-tool acknowledgement ("Let me check that for you…") sent while a tool call is in progress. final: true marks the last message of the turn.

done

{
  "event": "done",
  "turn_id": "…", "conversation_id": "…", "seq": 3,
  "latency_ms": 1240,
  "escalate": false,
  "escalate_reason": null,
  "resolved": false,
  "contact_updates": {"name": "Alice", "language": "en-GB"},
  "summary": "User asked about opening hours."
}

escalate_reason values: agent_unavailable · llm_unavailable · timeout · tool_loop_limit · internal_error

resolved: true when the end_conversation tool was called.

contact_updates carries any contact fields the LLM captured during the turn (name, phone, email, language) so the caller can persist them.

error

{"event": "error", "reason": "Agent configuration not found", }

Fatal failure; no done event follows.

location_request

{"event": "location_request", "text": "Please share your location so I can find your nearest branch.", }

GET /health

{
  "status": "ok",
  "dispatcher": {
    "inflight": 5,
    "active_conversations": 3,
    "max_concurrent": 100,
    "max_inflight": 1000
  }
}

Used by Kubernetes liveness and readiness probes.


GET /agents/{agent_id}/greeting

Returns the agent's greeting text for a given language.

Required header: x-tenantId
Optional query param: language_code (BCP-47, e.g. en-GB, es, fr-FR)

{
  "agent_id": "…",
  "language_code": "en-GB",
  "greeting": "Hi! Welcome to Kings Dental Center. How can I help you today?"
}

The greeting is resolved from the agent's base_greeting field in Compass — either a string (single language) or a {language_code: text} dict.


Built-in Tools

Herald provides a set of agent-neutral function tools that the LLM can call regardless of which MCP server (if any) the agent uses:

Tool Action
search_knowledge Semantic KB search via Compass; top-N chunks injected into context
transfer_to_human_agent Raise TransferRequested; triggers escalate: true in done event
switch_language Acknowledge language switch in reply
report_user_language Signal detected language; recorded in contact_updates
report_clarification_failure Signal repeated failure to understand input
end_conversation Raise ConversationEnded; triggers resolved: true in done event
find_nearest_branch Geocode user address, calculate distances, return nearest location
capture_contact_info Store name/phone/email/language from user; included in done.contact_updates
request_location Raise LocationRequested; emits location_request event

When an MCP server is configured, its tools are merged with the built-ins. On MCP session failure, Herald falls back to built-ins only.


MCP Tool Integration

Each agent in Compass can have an mcp_server_url. On each turn:

  1. Herald opens a streamable-HTTP MCP session to that URL.
  2. list_tools() returns the agent-specific tool catalogue (e.g. appointment booking, order lookup, policy query).
  3. These are merged with the built-in tools and passed to the LLM.
  4. The LLM issues tool calls; Herald dispatches each to the MCP server via call_tool(name, args).
  5. Results are appended to the message list and the LLM is called again (up to MCP_MAX_ITERATIONS).

If the MCP session can't be opened, Herald logs a warning and continues with KB-only tools.


LLM Providers

The active provider is selected by DEFAULT_LLM_PROVIDER. All providers support text completion and function calling.

Provider Default model Notes
openai gpt-4o-mini Native OpenAI SDK
azure gpt-4 Custom deployment + API version
google gemini-2.5-flash Native google-genai SDK; OpenAI-compat fallback

Per-agent model override: if AgentConfig.model is set in Compass, that model is used for that agent's turns regardless of the global default.

Langfuse tracing (optional): when LANGFUSE_TRACING=true, all LLM calls are captured in Langfuse for cost tracking and prompt evaluation.


Compass Integration

Herald calls two Compass endpoints on every turn:

Call When On failure
GET /tenants/agents/{agent_id} Start of every turn CompassError → escalate turn
POST /knowledge/search When search_knowledge tool is called Return empty results (graceful degradation)

The AgentConfig from Compass drives: - System prompt (agent_prompt) - Hallucination guard prompt - MCP server URL - Supported languages - Base greeting - Tool-specific prompt fragments (tool_prompts)


Herald vs Atlas

Both services use the same Compass agent configuration. The same agent record powers both channels simultaneously.

Atlas Herald
Channel Voice (phone calls) Text (WhatsApp, SMS, web chat)
Processing Real-time streaming, call state machine Async request-reply, stateless turns
Server state Call-stateful (active calls in memory) Stateless (each turn is independent)
Format guidance Voice prosody, SSML hints WhatsApp markdown rules, [SPLIT] message splitting
Location tool Not applicable request_locationlocation_request event
Greeting delivery Immediate TTS on call connect Text event; caller decides timing

Configuration Reference

Variable Default Purpose
SERVICE_PORT 8090 Uvicorn listen port
LOG_LEVEL INFO Logging level
COMPASS_URL (required) Compass base URL
COMPASS_TIMEOUT 10.0 Compass request timeout (seconds)
DEFAULT_LLM_PROVIDER openai Active LLM provider: openai · azure · google
OPENAI_API_KEY OpenAI API key
OPENAI_DEFAULT_MODEL gpt-4o-mini Default OpenAI model
GOOGLE_API_KEY Google Gemini API key
GOOGLE_DEFAULT_MODEL gemini-2.5-flash Default Gemini model
GOOGLE_MAPS_API_KEY Google Maps geocoding key (falls back to OSM if absent)
AZURE_OPENAI_ENDPOINT Azure OpenAI endpoint URL
AZURE_OPENAI_API_KEY Azure OpenAI API key
AZURE_OPENAI_API_VERSION 2025-02-01-preview Azure API version
AZURE_OPENAI_DEPLOYMENT gpt-4 Azure deployment name
KB_ENABLED true Enable search_knowledge tool
KB_SCORE_THRESHOLD 0.5 Minimum relevance score for KB chunks
KB_LIMIT 6 Max KB chunks to include in context
HISTORY_MAX_TURNS 20 Conversation history window
LLM_MAX_TOKENS 1024 Maximum output tokens
LLM_TEMPERATURE 0.3 LLM temperature
MCP_ENABLED true Enable agent MCP tools
MCP_MAX_ITERATIONS 5 Maximum tool-call loop iterations per turn
MAX_CONCURRENT_TURNS 100 Parallel LLM executions (semaphore)
MAX_INFLIGHT_TURNS 1000 Total queued + running tasks before 503
TURN_TIMEOUT_SECONDS 60.0 Per-turn deadline
SHUTDOWN_DRAIN_SECONDS 30.0 Graceful shutdown drain window
CALLBACK_TIMEOUT_SECONDS 5.0 Per-callback-request timeout
CALLBACK_MAX_RETRIES 3 Exponential back-off retry count
CALLBACK_SHARED_SECRET Bearer token added to callback Authorization header
LANGFUSE_TRACING false Enable Langfuse LLM tracing
LANGFUSE_PUBLIC_KEY Langfuse public key
LANGFUSE_SECRET_KEY Langfuse secret key
LANGFUSE_HOST http://localhost:3000 Langfuse host

Key Design Decisions

Stateless turns — Herald holds no conversation state between turns. All context (history, contact) is passed in the request, making the service horizontally scalable without sticky sessions.

Per-conversation orderingConversationLockMap uses asyncio locks keyed by conversation_id. If two messages from the same conversation arrive before the first turn completes, the second queues behind the first. This prevents out-of-order replies without requiring a message broker.

Bounded concurrency — A global semaphore caps parallel LLM calls at MAX_CONCURRENT_TURNS. Total queued tasks are capped at MAX_INFLIGHT_TURNS. Both limits return 503 when exceeded so the caller can back-pressure rather than overload the LLM API.

Graceful degradation — MCP failure falls back to KB-only tools. KB search failure returns empty results. Both keep the turn alive rather than failing hard.

Idempotent callbacks — Every event carries turn_id:seq as an idempotency key. Receivers that store this key can safely tolerate re-delivery without duplicating messages.