Skip to main content

How the Voice Agent Works

Last updated: 2026-03-28


The Body Analogy

Think of the voice agent as a person answering the phone. Each service is a body part:

┌──────────────────────────────┐
│ THE AGENT │
│ │
Phone Call ──► │ 👂 EARS (STT) │
(caller speaks) │ Deepgram Nova-3 │
│ Hears speech → text │
│ │
│ 🧠 BRAIN (LLM) │
│ Gemini 2.5 Flash / │
│ Claude Haiku 4.5 │
│ Thinks → decides response │
│ │
│ 📚 MEMORY (RAG) │
│ Church KB + Theology │
│ Recalls facts on demand │
│ │
│ 🛡️ CONSCIENCE (Moderation) │
│ Crisis / Threat / Abuse │
│ Filters before brain acts │
│ │
│ 👄 VOICE (TTS) │
│ Cartesia Sonic 3 │
│ Text → natural speech │
│ │
│ 🙉 FOCUS (Noise Filter) │
│ Drops "um", "uh", fillers │
│ Passes real words through │
│ │
│ 🤝 SOCIAL SENSE (Turn Detect) │
│ Knows when to listen vs │
│ when to speak │
│ │
│ 🖐️ HANDS (Tools) │
│ Books appointments │
│ Sends texts │
│ Submits prayer requests │
│ Captures visitor info │
│ │
Phone Call ◄── │ ☎️ PHONE LINE (SIP) │
(agent speaks) │ LiveKit ↔ Telnyx/Twilio │
│ Carries the call │
└──────────────────────────────┘

What Happens When Someone Calls

  1. Phone rings → Telnyx or Twilio receives the call from the PSTN
  2. SIP routing → Call forwarded to LiveKit Cloud via SIP trunk
  3. LiveKit matches → Finds our trunk by phone number, dispatches agent
  4. Agent starts → Loads church config, RAG context, product knowledge
  5. Greeting plays → LLM generates welcome, TTS speaks it ("Thank you for calling...")

Then for each thing the caller says:

  1. Ears hear → Deepgram transcribes speech to text (STT)
  2. Focus filters → Drops filler words ("um", "uh"), passes real speech
  3. Conscience checks → Moderation scans for crisis/threat/abuse BEFORE brain sees it
  4. Brain thinks → LLM processes the text with church context + RAG knowledge
  5. Hands act → If needed, tools fire (prayer request, callback, text link)
  6. Voice speaks → Cartesia converts LLM response to natural speech (TTS)
  7. Caller hears → Audio sent back through SIP to caller's phone

When the call ends:

  1. Farewell → Agent says goodbye, calls end_call tool
  2. Room closes → LiveKit disconnects after 8s delay (TTS finishes)
  3. Transcript saved → Conversation written to DB
  4. Classification → Gemini Flash analyzes: summary, sentiment, topics, urgency
  5. Notifications → Email/SMS sent to church staff (if not in testing mode)

Service Map — Every Component

Telephony (Phone Lines)

ServiceRoleWhen Used
TelnyxSIP provider for NEW customer numbersAll new churches get Telnyx numbers
TwilioSIP provider for LEGACY numbersDemo lines, sales line, toll-free
LiveKit Cloud SIPSIP gateway — bridges phone calls to WebRTCEvery call

Call path: Caller → PSTN → Telnyx/Twilio → SIP INVITE → LiveKit SIP Gateway → Agent

Key config:

  • LiveKit SIP URL: 5u9xu5ysoly.sip.livekit.cloud (project ID, NOT project name)
  • Telnyx FQDN connection: 2925216093662349036 → points to LiveKit SIP URL
  • Main trunk: ST_Xa3Bp9aixRFP — holds all phone numbers (LOCKED)
  • Dispatch rules route trunk → agent name churchwiseai-voice

Speech-to-Text (STT) — The Ears

ServiceModelRoleLatency
DeepgramNova-3Primary STT~200ms

Deepgram Nova-3 is the primary (and currently only) STT. It's the best for phone audio quality — handles background noise, accents, and low-bitrate G.722 codec well.

Configuration: stt="deepgram/nova-3" in AgentSession

Fallback: No STT fallback currently configured. If Deepgram goes down, calls will connect but the agent won't understand speech. TODO: Add Whisper or Google STT as fallback.

Large Language Model (LLM) — The Brain

ServiceModelRoleSpeedCost
GoogleGemini 2.5 FlashCoordinator, Sales, Demo agentsVery fastLow
AnthropicClaude Haiku 4.5Care Agent (pastoral/emotional)FastMedium

Why two brains?

  • Gemini Flash is fast and cheap — great for factual Q&A (service times, directions, events)
  • Claude Haiku has better empathy — handles grief, prayer, crisis with more nuance

Configuration:

COORDINATOR_MODEL = "google/gemini-2.5-flash" # Fast, factual
CARE_MODEL = "anthropic/claude-haiku-4-5-20251001" # Empathetic, careful

Fallback: If one LLM fails, the system should fall back to the other. Currently no automatic fallback — TODO: implement LLM fallback chain.

Text-to-Speech (TTS) — The Voice

ServiceModelRoleLatency (TTFB)
CartesiaSonic 3Primary TTS~200ms

Cartesia Sonic 3 produces the most natural-sounding voice for phone calls. Supports custom voices per church.

Configuration: tts="cartesia/sonic-3:{voice_id}" in AgentSession

Default voices:

  • Male: Carson (86e30c1d-714b-4074-a1f2-1cb6b552fb49)
  • Female: Cindy (1242fb95-7ddd-44ac-8a05-9e8a22a6137d)
  • Default for new churches: Cindy (female)

Per-church custom voice: Set voice_id in church_voice_agents table. Must be a valid Cartesia UUID. ElevenLabs IDs will NOT work (caused dead air for Zewdei — fixed 2026-03-28).

Fallback: No TTS fallback currently configured. If Cartesia goes down, calls will connect but the agent will be silent. TODO: Add LiveKit TTS or Google TTS as fallback.

Voice Activity Detection (VAD) — Hearing Attention

ServiceModelRole
SileroVAD v5Detects when someone is speaking vs silence

Pre-warmed at agent startup (loaded once per worker process, not per call). Combined with the multilingual turn detector for end-of-utterance detection.

Turn Detection — Social Awareness

ServiceModelRole
LiveKitMultilingual Turn DetectorKnows when caller has finished speaking

End-of-utterance delay: ~600ms. This is the pause after the caller stops speaking before the agent starts responding. Too short = agent interrupts. Too long = awkward silence.

RAG (Retrieval-Augmented Generation) — Memory

ComponentSourceWhen Used
Church KBchurch_knowledge_base tablePer-turn (500ms timeout)
Theological contentunified_rag_content + sai_theological_lensesSession start (one-time)
Product knowledgeproduct_knowledge tableSession start (one-time)
Repeat caller historyvoice_call_logs by phoneSession start (one-time)

Embeddings: OpenAI text-embedding-3-small RPCs: search_church_knowledge, search_unified_rag_content Theological lenses: 17 denominations mapped to lens IDs (Baptist→14, Catholic→7, etc.)

Moderation — Conscience

CheckWhat It CatchesAction
ThreatViolence, weapons, bomb threatsEnd call immediately
CrisisSuicidal ideation, self-harm, coded languageInject 988 Lifeline into LLM context
AbuseProfanity, harassment1st: warning. 2nd+: end call

Processing order: Moderation runs BEFORE the LLM sees the text. A crisis caller gets help resources injected into the response. A threat caller gets disconnected immediately.

Crisis detection includes: Direct statements ("kill myself"), coded language (elderly: "tired of living", religious: "ready to go home to be with the Lord", farewell: "giving away my things"), C-SSRS Q1, burden language.

Context-aware: "Ready to go to church" does NOT trigger crisis. "Ready to go" (standalone) DOES.

Noise Filtering — Focus

CategoryExamplesAction
Pure noiseum, uh, hmm, ah, erAlways dropped
Backchannelsuh huh, mm hmm, i seeAlways dropped
Context-dependentokay, yeah, good, perfectDropped if agent didn't ask a question
Floor-takeswait, stop, no, hold onAlways passed (barge-in)
Meaningfulthanks, bye, goodbyeAlways passed

Tools — Hands

ToolAgentWhat It Does
capture_visitor_infoCoordinatorSaves visitor contact to DB
send_directions_linkCoordinatorTexts Google Maps link
register_for_eventCoordinatorRegisters for church event
send_giving_linkCoordinatorTexts giving/donation URL
check_availabilityCoordinatorChecks Cal.com calendar
book_appointmentCoordinatorBooks via Cal.com
submit_prayer_requestCareSaves prayer to DB + notifies team
request_callbackCareSaves callback request + notifies pastor
send_sms_linkAllTexts any URL to caller
end_callAllSays farewell, waits 8s, disconnects
schedule_demoSalesCaptures demo request
search_churchesSalesSearches PewSearch directory
capture_supportSalesLogs tech support request

All tools are conditionally enabled based on church config (Cal.com keys, PCO credentials, giving_enabled flag, etc.). See church_voice_agents table.


Fallback Chain — Full Picture

PLATFORM LEVEL:
Primary: LiveKit Cloud (Agents v1.5, Python)
Fallback: None (voice-agent-livekit// is legacy, no longer maintained)
Trigger: LiveKit Cloud outage > 1 hour

TELEPHONY:
New customers: Telnyx (FQDN connection → LiveKit SIP)
Legacy numbers: Twilio (SIP trunk → LiveKit SIP)
If Telnyx FQDN fails: TeXML webhook bridge (churchwiseai.com/api/telnyx/voice-webhook)

STT (Ears):
Primary: Deepgram Nova-3
Fallback: NONE CONFIGURED ���️
TODO: Add Google STT or Whisper as fallback

LLM (Brain):
Coordinator: Gemini 2.5 Flash → (no auto-fallback) → Claude Haiku 4.5
Care Agent: Claude Haiku 4.5 → (no auto-fallback) → Gemini 2.5 Flash
TODO: Implement automatic LLM fallback

TTS (Voice):
Primary: Cartesia Sonic 3
Fallback: NONE CONFIGURED ⚠️
TODO: Add Google TTS or LiveKit TTS as fallback

VAD:
Primary: Silero v5 (pre-warmed)
Fallback: Built into LiveKit (basic energy detection)

RECORDING:
Status: NOT IMPLEMENTED
Plan: LiveKit Egress (audio-only MP3) → S3/R2 bucket
Cost: ~$0.004/min ($12/mo at 1000 calls)

POST-CALL:
Transcript: conversation_item_added event → saved to voice_call_logs
Classification: Gemini 2.5 Flash → summary, sentiment, topics, urgency
Notifications: Resend (email) + Twilio (SMS), fire-and-forget

Key Files

FileWhat It Does
voice-agent-livekit/main.pyEntry point — SIP routing, session setup, transcript capture
voice-agent-livekit/session.pyPhone registry, Supabase client, call logs, classify_call()
voice-agent-livekit/safety.pySafeAgent base class — pre-LLM moderation via llm_node
voice-agent-livekit/moderation.pyThreat/crisis/abuse regex patterns
voice-agent-livekit/call_handler.pyNoise filtering, farewell detection
voice-agent-livekit/core/rag.pyEmbeddings + Supabase RPC search
voice-agent-livekit/core/notifications.pyEmail/SMS fan-out, testing mode redirect
voice-agent-livekit/core/prompt_fragments.pyHEAR protocol, crisis protocol, guardrails
voice-agent-livekit/core/tools.pySMS link sender, directions sender
voice-agent-livekit/verticals/church/agents.pyCoordinatorAgent + CareAgent classes
voice-agent-livekit/verticals/church/prompts.pyPer-church prompt builder
voice-agent-livekit/verticals/church/tools.pyPrayer, callback, visitor, event tools
voice-agent-livekit/verticals/sales/agents.pySalesAgent, DemoRouterAgent, DemoAgent
voice-agent-livekit/verticals/sales/prompts.pySales prompt builder
voice-agent-livekit/verticals/church/config.pyTier gating, default voice IDs

Documentation Sources

See knowledge/references/voice-agent-sources.md for the full list of LiveKit and Telnyx documentation, GitHub repos, community channels, and Context7 MCP library IDs.