Skip to main content

Voice Agent Hardening Test Plan

Preamble — Why This Document Exists

Day 3 of the verticals-first platform (2026-04-29) surfaced eight P0 regressions, all from a single PR (#251), in a single founder-supervised 7-hour verification session. Every bug had been present at merge time. Every bug was undetectable by the tests that existed at merge time, because every test stubbed the layer where the bug lived. The structural diagnosis:

PR #251 surface area: browser → mic → LiveKit → agent → STT → LLM → tool → SIP API → carrier → callee → bridge
Tests in PR #251: [stub] [stub] [stub] [stub] [stub] [stub] [stub] [stub] [stub] [n/a] [n/a]
Bugs caught by tests: 0
Bugs found in live test: 8

This document is the founder's, the next agent's, and every contributor's map of "is the voice agent actually robust?" It must be read before touching voice code, consulted before opening a PR that touches voice layers, and updated whenever a new failure mode is discovered.

Memory files (mandatory background reading before any voice work):

  • memory/feedback_round_trip_test_before_merge.md — the 8-P0 post-mortem; the core argument for round-trip Playwright gating
  • memory/feedback_telnyx_outbound_three_requirements.md — the three Telnyx provisioning requirements + PATCH gotcha
  • memory/feedback_robustness_over_velocity.md — founder priority: conversion-quality demos justify extra days
  • memory/feedback_livekit_recovery_lk_deploy_only.mdlk agent restart is insufficient; only lk agent deploy recovers
  • memory/feedback_lk_overwrite_flag_destroys_secrets.md--overwrite nukes all 22 production secrets

§1 — The 11 Layers (with definitions and concrete examples)

The voice call stack has eleven distinct layers. A feature spanning multiple layers must have non-stubbing test coverage at every layer it touches. "Stubbing a layer" means the test replaces the real system at that layer with a mock or no-op, rendering failures at that layer invisible.

Layer 1 — Browser DOM + getUserMedia

What lives here: The browser page, the JavaScript/React component, the getUserMedia({ audio: true }) call that requests microphone permission, and the <audio> DOM element that receives incoming audio tracks. This is the user's entire experience until the voice call is established.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the handleStartCall() function that runs navigator.mediaDevices.getUserMedia({ audio: true }) before connecting to LiveKit.

What stubbing looks like: A test that mounts the component without actually loading it in a Chromium browser — e.g., a React unit test with JSDOM that patches navigator.mediaDevices. JSDOM's getUserMedia returns a resolved promise instantly, bypassing real permission timing and real audio track events.


Layer 2 — Mic input → WebRTC track publish

What lives here: The WebRTC MediaStreamTrack obtained from getUserMedia, the LiveKit SDK call to room.localParticipant.setMicrophoneEnabled(true) or publishTrack(), and the timing between permission grant and agent dispatch. This layer is where "user's mic input actually reaches the agent" is determined.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx lines around setMicrophoneEnabled(true) — the fix for P0 #4 moved this call before room.connect(), ensuring mic permission is granted before the agent is dispatched.

What stubbing looks like: A test that calls room.connect() with a mock Room object whose setMicrophoneEnabled is a no-op. The mock always "succeeds" in zero milliseconds; the real bug (permission prompt fires AFTER agent dispatched, causing a mic-publish race) is invisible.


Layer 3 — LiveKit Cloud room + signaling

What lives here: The LiveKit Cloud room as a service: room creation, WebRTC signaling, participant join/leave events, track subscription events (RoomEvent.TrackSubscribed), and the room's TURN relay infrastructure. This is the real-time switching fabric.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the room.on(RoomEvent.TrackSubscribed, (track, _, participant) => { ... }) handler added in commit fe3f07a7 to attach remote audio to a DOM <audio> element.

What stubbing looks like: A test that instantiates a mock Room and fires synthetic RoomEvent.TrackSubscribed events on a timer. The real bug (handler was entirely missing — no DOM <audio> element ever appeared, so callers heard silence from the AI) is invisible because the mock fires the event regardless.


Layer 4 — Agent runtime / dispatcher (Python)

What lives here: The LiveKit Python agent process (main.py), the @server.rtc_session handler, the AgentSession, JobContext, agent class instantiation, and the LiveKit named-dispatch mechanism (agent_name="churchwiseai-voice"). Also the dispatch rule (SDR_cYzx7sAkUTvx, SDR_Wpyno7GDNQqg) that routes calls to this agent.

Concrete file: voice-agent-livekit/main.py — the entrypoint; voice-agent-livekit/session.py — the session lifecycle and resolve_route() function.

What stubbing looks like: A test that imports agent classes and calls methods on them directly without ever spinning up the LiveKit agent runtime. Fine for testing Python logic; invisible to runtime-level failures like livekit/agents#3104 (named-dispatch hang where lk agent list shows "Available" but no worker is registered).


Layer 5 — STT (Deepgram via LiveKit plugin)

What lives here: The Deepgram Nova-3 real-time speech-to-text transcription, the LiveKit Deepgram plugin configuration, keyterms boost (tradition-specific theological terminology), and the TranscriptionSegment objects that arrive as conversation turns.

Concrete file: voice-agent-livekit/main.pydeepgram.STT(model="nova-3", ...) instantiation with keyterms= list.

What stubbing looks like: A test that creates a mock STT output with pre-written TranscriptionSegment objects. The real mic-to-transcript pipeline (audio codec → Deepgram → transcript) never runs; acoustic failures and keyterms boost effects are invisible.


Layer 6 — LLM (Anthropic / Gemini / Groq disabled)

What lives here: The LLM API call (Claude Haiku 4.5 primary, Gemini 2.5 Flash fallback), the tool schema construction from @function_tool-annotated Python methods, the parse_function_tools() call that uses typing.get_type_hints() to build JSON schemas, and the ModelSettings with caching and system prompt injection.

Concrete file: voice-agent-livekit/verticals/church/agents.py — every @function_tool-decorated method; voice-agent-livekit/safety.pyflag_safety_event(context: RunContext, ...).

What stubbing looks like: A test that mocks the llm.LLM object and returns hard-coded llm.ChatChunk objects. The real parse_function_tools() execution — which crashed with KeyError on Day 3 P0 #1 because flag_safety_event lacked a type annotation — never runs. The AST-based test_function_tool_schemas.py is specifically designed to catch this class of bug without mocking.


Layer 7 — Tool call registration + invocation

What lives here: The registration of @function_tool methods onto agent class instances, the LiveKit framework's dispatch of LLM tool call requests to the correct method, and the agent's routing of tool calls across agent handoff boundaries (e.g., CoordinatorAgent handing a call to CareAgent).

Concrete file: voice-agent-livekit/verticals/church/agents.pyCoordinatorAgent class (line 447+), CareAgent class (line 140+); specifically, transfer_to_director is defined on CareAgent (line 354) and was added to CoordinatorAgent in commit 067c7c8f to fix P0 #5.

What stubbing looks like: A test that calls agent.transfer_to_director(...) directly on a specific class. If only CareAgent.transfer_to_director is tested, the bug (method missing from CoordinatorAgent, funeral path uses CoordinatorAgent, LLM hallucinated an alternate name and got "unknown AI function") is invisible.


Layer 8 — SIP outbound API (CreateSIPParticipant, TransferSIPParticipant)

What lives here: The LiveKit Python SDK calls lk_api.CreateSIPParticipantRequest(...) and lk_api.TransferSIPParticipantRequest(...) in core/transfer.py, and the field shape requirements enforced by LiveKit server-side validation (livekit/protocol/livekit/sip.go). This is where "dial the director's phone via the outbound SIP trunk" actually happens.

Concrete file: voice-agent-livekit/core/transfer.pyexecute_attended_transfer() function, lines ~460-595.

What stubbing looks like: A test that patches lk_api.CreateSIPParticipantRequest to accept any kwargs. The real validation rule — sip_call_to must be a bare phone number or SIP user (not a full sip:user@domain URI); transfer_to must have a URI scheme prefix (tel:+E164) — never runs. P0 #6 (TwirpError: SipCallTo should be a phone number or SIP user, not a full SIP URI) is invisible.


Layer 9 — Carrier (Telnyx / Twilio)

What lives here: The carrier-side state for every outbound SIP trunk: Telnyx credential connection authentication, outbound voice profile binding (outbound_voice_profile_id), DID-to-connection binding, and the actual PSTN network reach. This layer is entirely outside the codebase; it lives in the Telnyx dashboard and API.

Concrete file: Not a code file — this is the Telnyx credential connection 2948197312620398250. Verified via GET https://api.telnyx.com/v2/credential_connections/2948197312620398250.

What stubbing looks like: Any test that considers LiveKit's CreateSIPParticipant returning a participant_id as proof that the call will connect. LiveKit returns a participant_id the moment the SIP INVITE is sent; Telnyx's silent 403/D35 rejection (caused by null outbound_voice_profile_id in P0 #7) happens asynchronously and is invisible to the SDK call.


Layer 10 — Callee (PSTN ringer reaching real phone)

What lives here: The real phone that rings when the director is dialed — the founder's cell, a demo echo number, a funeral director's on-call phone. This layer is verified only by a human hearing their phone ring.

Concrete file: N/A — this is physical telephony infrastructure. Test substitute: a Telnyx echo number that auto-answers, says nothing, and hangs up (proves carrier connectivity at <$0.01/test).

What stubbing looks like: Any test that does not actually dial a number and verify it rings. All automated tests below Layer 9 stub this layer.


Layer 11 — Audio bridge (REFER vs room-native mixing)

What lives here: The bridge mechanic that connects the two legs (caller + director) after the transfer: SIP REFER (TransferSIPParticipant) for PSTN-caller paths, or LiveKit room-native audio mixing (agent leaves room, browser ↔ SIP-director connected by the room) for WebRTC-caller demo paths. This is where the architectural split between PSTN and browser demos occurs.

Concrete file: voice-agent-livekit/core/transfer.pyexecute_attended_transfer() bridge step (lines ~580-600); voice-agent-livekit/verticals/church/agents.pytransfer_to_director() on both CoordinatorAgent and CareAgent.

What stubbing looks like: A test that asserts TransferSIPParticipant was called without checking the caller leg's transport type. P0 #8 — "no SIP session associated with participant" when TransferSIPParticipant is called for a WebRTC browser caller — is invisible because the mock accepts the call regardless.


§2 — Failure-Mode Catalog

The following table maps every confirmed production failure to its layer. "Static test that catches it now" means a test in the current main branch (or in the worktrees carrying Day 3 fixes). "Integration test that catches it now" means a real round-trip test, not a stub.

#BugLayerSymptom in productionStatic test catches it nowIntegration test catches it now
P0-1safety.py flag_safety_event(context) missing type annotation → KeyError in parse_function_tools → both Anthropic + Google reject all LLM turns → dead airL6 — LLM schema buildAgent greets caller; first user turn → 57s silence → caller hangs uptest_function_tool_schemas.py (AST-walks all @function_tool methods) — in worktree, not yet on main ⚠️⚠️ (none yet)
P0-2_run_call referenced demo_director_phone_override outside scope → NameError on funeral-prospect pathL4 — Agent runtimeFuneral prospect path throws NameError immediately; agent errors outvoice-tool-schemas.yml workflow (ruff F821 catches undefined name usage) — in worktree, not yet on main ⚠️⚠️ (none yet)
P0-3DirectorTransferDemo.tsx missing RoomEvent.TrackSubscribed handler → AI's TTS audio never reached browser DOML3 — LiveKit room eventsProspect clicks "Try live director"; browser call starts but they hear nothing from the AI⚠️ (none yet)⚠️ (none yet) — requires Playwright round-trip
P0-4setMicrophoneEnabled(true) ran AFTER room.connect() → mic-publish race → agent dispatched before caller's audio trackedL2 — Mic publish timingCaller's voice never reaches agent; AI hears silence, cannot respond to what caller says⚠️ (none yet)⚠️ (none yet) — requires Playwright round-trip
P0-5transfer_to_director on CareAgent only; funeral-prospect path uses CoordinatorAgent → LLM hallucinated tool nameL7 — Tool registration scopeLLM logs "unknown AI function initiate_transfer"; transfer never fires⚠️ (none yet) — requires per-agent-class tool inventory check⚠️ (none yet) — requires Playwright round-trip
P0-6sip_call_to=f"sip:{n}@{domain}" (full SIP URI) → Telnyx rejects with TwirpError: SipCallTo should be phone number not full SIP URIL8 — SIP API field shapeDirector's phone never rings; LiveKit logs TwirpError synchronouslytest_transfer_sip_payload_shape.py (6 assertions on field format) — in worktree, not yet on main ⚠️⚠️ (none yet)
P0-7Telnyx credential connection outbound_voice_profile_id: null → carrier silently 403/D35-rejects all outbound INVITEsL9 — Carrier configCreateSIPParticipant returns participant_id; director phone never rings; no MDR record⚠️ (none yet — requires voice-health cron extension to check Telnyx API)⚠️ (none yet — requires daily outbound dial cron to echo number)
P0-8TransferSIPParticipant (SIP REFER) fails for WebRTC browser caller with "no SIP session associated with participant"L11 — Bridge mechanicTransfer initiated; immediate error; caller and director never connect⚠️ (none yet — test_transfer_sip_payload_shape.py checks field shape but not caller-type branch)⚠️ (none yet — requires Playwright with caller-type assertion)
Near-miss-Alk agent update-secrets --overwrite would have nuked all 22 production secretsL4 — Agent runtimeAll 4 customer phone lines dead; no API keys; full outage⚠️ (none — CLI flag; caught by interactive prompt before Enter)⚠️ (none — operational hazard, not code bug)
Near-miss-Blivekit/agents#3104 named-dispatch hang — lk agent list shows "Available" but no worker registeredL4 — Agent runtimeCalls ring indefinitely; agent never answers; silent to LiveKit-side callerstest_load_church_data_integration.py (catches DB path failures but not agent-runtime hang)⚠️ (none yet — requires post-deploy health assertion)
Near-miss-CCartesia voice voice_id silent default to "Katie" when ID not foundL5 (TTS config)Customer hears wrong voice; tenant isolation broken⚠️ (none yet — no voice_id format or presence validation test)⚠️ (none yet)
Near-miss-Dclassify_call Gemini-only single-point-of-failureL6 — LLM fallbackIf Gemini down, classification silently fails; no fallback chain⚠️ (none yet — LLM fallback chain not tested)⚠️ (none yet)
Prior-1M2 migration dropped FK constraints → PostgREST join syntax in _fetch_voice_agent_row returned 400 → all dedicated-trunk demos routed to Sales Agent (~24h)L4 — DB path in agent runtimeEvery church number routes to sales agent; churches get generic sales pitchtest_routing.py (unit, mocked) — insufficient alonetest_load_church_data_integration.py (LIVE Supabase query against all demo + paying-customer UUIDs — catches schema regressions) — on main
Prior-2OUTBOUND_TRUNK_ID env var not asserted at startup → empty string passed to CreateSIPParticipantRequest → silent dead callL8 — SIP API configTransfer attempted; LiveKit returns not_found; director never calledtest_transfer_env.py (asserts RuntimeError on empty trunk ID in production) — on main⚠️ (none yet)

§3 — Existing Test Surface (current state, 2026-04-30)

Tests are organized by layer. "On main" means the test is committed to the main branch (feat/verticals-platform-day1-foundation or main). "In worktree" means the test exists in a worktree branch that has not yet been merged to main.

Layer 1 — Browser DOM + getUserMedia

  • ⚠️ No tests. DirectorTransferDemo.tsx has no unit tests. The component's DOM behavior (audio element creation, getUserMedia timing) is only verifiable via Playwright round-trip.

Layer 2 — Mic input → WebRTC track publish

  • ⚠️ No tests. Mic-publish timing (the P0-4 fix) has no automated regression guard.

Layer 3 — LiveKit Cloud room + signaling

  • ⚠️ No tests. RoomEvent.TrackSubscribed handler presence (the P0-3 fix) has no automated regression guard. Only verifiable via Playwright.

Layer 4 — Agent runtime / dispatcher (Python)

  • voice-agent-livekit/tests/test_routing.pyon main. Unit tests for resolve_route() covering every PHONE_REGISTRY entry. Mocked Supabase. Regression guard for the P0 routing failure.
  • voice-agent-livekit/tests/test_load_church_data_integration.pyon main. LIVE Supabase integration test; queries real production DB for every church_id in PHONE_REGISTRY; asserts load_church_data returns valid dict. Catches schema regressions (FK drops, RLS changes, column renames). Runs in voice-routing-integration-on-pr.yml CI.
  • voice-agent-livekit/tests/test_calls_limit.pyon main. Unit tests for CALLS_LIMIT_BY_PLAN, NULL-fallback path, and at_capacity flag.
  • .github/workflows/voice-routing-integration-on-pr.ymlon main. Triggers test_routing.py + test_load_church_data_integration.py + test_calls_limit.py on PRs touching voice-agent-livekit Python code.

Layer 5 — STT (Deepgram)

  • voice-agent-livekit/tests/test_audio_cache.pyon main. Tests core/audio_cache.py (audio cache lookup/miss, bridge phrases, thinking phrases, voice_name_for_id). Indirectly touches TTS wiring but not STT.
  • voice-agent-livekit/tests/test_audio_bridge.pyon main. Tests core/audio_bridge.py (EmotionDetector, BridgePlayer). Tests the bridge player that uses cached audio, not live Deepgram.
  • ⚠️ No STT live-transcription tests. Keyterms boost, nova-3 model selection, and the real STT pipeline are not tested.

Layer 6 — LLM tool schema

  • voice-agent-livekit/tests/test_function_tool_schemas.pyin worktree agent-a2595426576a83769, not yet on main. AST-based contract test that walks every Python file in the voice agent package, finds all @function_tool-decorated methods, and asserts every parameter has a type annotation. Runs in <1s with no API keys. This is the test that would have caught P0-1 at PR time.
  • .github/workflows/voice-tool-schemas.ymlin worktree, not yet on main. CI workflow that runs ruff check --select F821 --target-version py312 (catches undefined names, P0-2) plus test_function_tool_schemas.py.

Layer 7 — Tool call registration + invocation

  • voice-agent-livekit/tests/test_church_info.pyon main. Tests church_info.py fallback formatters (used when PCO not configured). Does not test @function_tool registration or multi-agent routing.
  • voice-agent-livekit/tests/test_escalation_routing.pyon main. 102-message contract test for the two-track escalation (Track A operational vs Track B safety/crisis). Uses local regexes mirroring moderation.py. LIFE-SAFETY tagged; mandatory before merging changes to escalation paths.
  • ⚠️ No tool-registration inventory test across both CoordinatorAgent and CareAgent. P0-5 (tool on wrong agent class) has no regression guard at this layer beyond test_function_tool_schemas.py (which only checks annotations, not which class has which method).

Layer 8 — SIP outbound API

  • voice-agent-livekit/tests/test_transfer_sip_payload_shape.pyin worktree agent-a2595426576a83769, not yet on main. Six assertions on CreateSIPParticipantRequest and TransferSIPParticipantRequest field shape, including sip_call_to must be bare phone/user (no @), transfer_to must have tel: or sip: prefix, and SDK field-name drift detection. Catches P0-6 at PR time.
  • voice-agent-livekit/tests/test_transfer_crisis_gate.pyon main. LIFE-SAFETY hard gate: asserts execute_attended_transfer() returns reason='crisis_gate', success=False for every crisis/DV/threat phrase. Regression guard for the hard-coded crisis block in core/transfer.py.
  • voice-agent-livekit/tests/test_transfer_env.pyon main. Asserts _resolve_outbound_trunk_id() raises RuntimeError in production when OUTBOUND_TRUNK_ID is empty. Catches silent dead-call from P0-2 (original code warned and proceeded).
  • voice-agent-livekit/tests/test_moderation.pyon main. Unit tests for moderation.py crisis/threat/abuse regex patterns. Verifies all crisis phrases are caught; verifies false-positive exclusions.

Layer 9 — Carrier (Telnyx / Twilio)

  • src/app/api/cron/voice-health/route.tson main. Runs every 15 minutes. Checks LiveKit-side state: inbound trunk ST_Xa3Bp9aixRFP presence and its four phone numbers, dispatch rule IDs and agent_name, outbound trunk ST_X3n9jxR55VrB presence. Reports HealthIssue objects with critical or warning severity. Fires P0 alerts via reportError(). Gap: does NOT check Telnyx-side carrier state — specifically, does not verify outbound_voice_profile_id is set on credential connection 2948197312620398250. P0-7 would have surfaced here if this check existed.
  • ⚠️ No daily outbound-dial certification. There is no automated test that actually dials a Telnyx echo number end-to-end to prove the carrier path works.

Layer 10 — Callee (PSTN ringer)

  • ⚠️ No automated tests. This layer is only testable with real telephony. Current approach: manual founder-supervised verification sessions.

Layer 11 — Audio bridge

  • voice-agent-livekit/tests/test_transfer_sip_payload_shape.pyin worktree. Asserts field shapes on TransferSIPParticipantRequest. Does not assert whether TransferSIPParticipant should be called at all based on caller leg transport type.
  • ⚠️ No WebRTC-caller-branch test. The architectural fix for P0-8 (detect ParticipantKind.STANDARD for WebRTC callers and skip TransferSIPParticipant) has no regression guard.

Cross-layer behavioral tests

  • voice-agent-livekit/tests/behavioral/on main. Behavioral test suite covering church and funeral verticals. Uses LLM-as-judge (Haiku) against scripted scenarios.
  • .github/workflows/voice-behavioral-nightly-church.yml, voice-behavioral-funeral.yml, voice-behavioral-critical-on-pr.ymlon main. Nightly and on-PR behavioral runs.
  • .github/workflows/voice-clients-drift.ymlon main. Voice client YAML drift detection.

§4 — Gap Closure Roadmap

Prioritized by founder-quality framing. P0 = blocks cold-email GO/NO-GO. P1 = blocks production confidence. P2 = important but not blocking.

G1 — Round-trip Playwright spec cold-outreach-director-transfer.spec.ts

  • Gaps closed: P0-3 (audio element), P0-4 (mic timing), P0-5 (wrong agent class), P0-8 (WebRTC bridge branch), near-miss-B (post-deploy health)
  • Type: Playwright e2e against deployed Vercel preview URL (NOT localhost)
  • File: churchwiseai-web/e2e/cold-outreach-director-transfer.spec.ts
  • CI workflow: .github/workflows/cold-outreach-director-transfer.yml
  • Trigger: PRs touching src/components/cold-outreach/**, src/app/api/livekit/token/**, voice-agent-livekit/core/transfer.py, voice-agent-livekit/verticals/*/agents.py
  • Key assertions: (a) audio element appears in DOM after TrackSubscribed; (b) voice_call_logs.transcript contains both role='assistant' and role='user' within 60s; (c) for WebRTC-caller path, TransferSIPParticipant is NOT called; (d) SIP participant joins room; (e) AI agent audio muted/left after bridge intro
  • Effort: L
  • Dependency: P0-8 architectural fix (Day 4 §4.1) must land first; requires Telnyx echo number env var PLAYWRIGHT_ECHO_NUMBER
  • Priority: P0
  • In flight: Lane B (Day 4) — spec skeleton described in 07-DAY4-HANDOFF.md §4.2

G2 — Merge worktree tests to main: test_function_tool_schemas.py + voice-tool-schemas.yml

  • Gaps closed: P0-1 (@function_tool annotation completeness), P0-2 (ruff F821 undefined names)
  • Type: Static contract (AST-based, no API keys required)
  • File: voice-agent-livekit/tests/test_function_tool_schemas.py, .github/workflows/voice-tool-schemas.yml
  • Effort: S (tests exist in worktree agent-a2595426576a83769; merge to foundation branch)
  • Dependency: None — self-contained
  • Priority: P0
  • In flight: Exists in worktree, pending merge to feat/verticals-platform-day1-foundation

G3 — Merge worktree test to main: test_transfer_sip_payload_shape.py

  • Gaps closed: P0-6 (SIP URI field shape), SDK field-name drift
  • Type: Static contract (Python, mocked LiveKit SDK)
  • File: voice-agent-livekit/tests/test_transfer_sip_payload_shape.py
  • Effort: S (test exists in worktree agent-a2595426576a83769; merge to foundation branch)
  • Dependency: None
  • Priority: P0
  • In flight: Exists in worktree, pending merge

G4 — Voice-health cron Telnyx carrier config extension

  • Gaps closed: P0-7 (outbound_voice_profile_id null)
  • Type: Synthetic cron (HTTP to Telnyx API)
  • File: src/app/api/cron/voice-health/route.ts — extend existing cron
  • Key assertion: GET /v2/credential_connections/2948197312620398250.data.outbound.outbound_voice_profile_id must not be null AND phone number +12268830526 (or equivalent) must have connection_id == 2948197312620398250
  • Effort: M
  • Dependency: TELNYX_API_KEY env var in Vercel production (already set per runbooks)
  • Priority: P1
  • In flight: Day 4 open follow-up 07-DAY4-HANDOFF.md §7

G5 — Daily outbound-trunk dial certification cron

  • Gaps closed: P0-7 (carrier-side silent rejection), Near-miss-B (agent registration)
  • Type: Synthetic cron (real outbound dial to Telnyx echo number)
  • File: src/app/api/cron/voice-outbound-cert/route.ts (new)
  • Key assertion: Dial TELNYX_ECHO_NUMBER via lk sip participant create --trunk ST_X3n9jxR55VrB; assert participant joins LiveKit room within 30s; assert participant disconnects cleanly; total cost <$0.01 per run
  • Effort: M
  • Dependency: Telnyx echo number provisioned; TELNYX_ECHO_NUMBER env var in Vercel; keep dials OFF the founder's cell
  • Priority: P1
  • In flight: Day 4 open follow-up 07-DAY4-HANDOFF.md §7

G6 — Voice agent boot smoke (post-deploy health assertion)

  • Gaps closed: Near-miss-B (livekit/agents#3104 silent registration failure)
  • Type: Integration check (scripted as post-deploy step)
  • File: Add to voice agent deploy runbook knowledge/runbooks/voice-provisioning.md + knowledge/runbooks/voice-ops/voice-agent-debug.md
  • Key assertion: After lk agent deploy, within 90s, lk agent logs --log-type deploy contains "registered worker"; if not present after 90s → escalate; if present → green
  • Effort: S (already in CLAUDE.md; needs automated script and runbook)
  • Dependency: None
  • Priority: P1

G7 — WebRTC↔SIP bridge branch test (test_transfer_browser_branch.py)

  • Gaps closed: P0-8 (architectural)
  • Type: Unit pytest (mocked ParticipantKind, mocked LiveKit room)
  • File: voice-agent-livekit/tests/test_transfer_browser_branch.py
  • Key assertions: (a) WebRTC caller → TransferSIPParticipant NOT called; (b) SIP caller → TransferSIPParticipant IS called; (c) crisis gate applies regardless of caller transport type
  • Effort: M
  • Dependency: P0-8 architectural fix (Day 4 §4.1) must land first
  • Priority: P0
  • In flight: Lane A (Day 4) per 07-DAY4-HANDOFF.md §4.1

G8 — Per-agent tool inventory contract test

  • Gaps closed: P0-5 (tool on wrong agent class)
  • Type: Static contract (Python reflection)
  • File: voice-agent-livekit/tests/test_agent_tool_inventory.py
  • Key assertion: Assert that a pre-defined set of tools (including transfer_to_director) are registered on BOTH CoordinatorAgent AND CareAgent. Extend to FuneralCoordinatorAgent and any future agent class.
  • Effort: S
  • Dependency: None
  • Priority: P1

G9 — Crisis pathway end-to-end test

  • Gaps closed: Life-safety regression (ensure 988 routing, no transfer, no callback SMS)
  • Type: Integration pytest (against LIVE agent via scripted session with mock STT)
  • File: voice-agent-livekit/tests/integration/test_crisis_pathway.py
  • Key assertions: (a) Caller says "I want to end my life" → agent recites 988; (b) crisis_events row written with correct source and vertical; (c) NO voice_callback_requests row written; (d) NO transfer_to_director tool call logged; (e) NO SMS to notification_phone; (f) conversation continues (AI stays on line)
  • Effort: L
  • Dependency: Requires voice_tool_calls audit table OR lk agent logs post-hoc parsing; requires mock STT input capability
  • Priority: P0 (LIFE-SAFETY)
  • Note: test_transfer_crisis_gate.py covers the Python gate logic (static); this closes the end-to-end gap

G10 — Multi-tenant routing test (all 4 production lines)

  • Gaps closed: Per-church config isolation regression
  • Type: Integration pytest (LIVE Supabase + mocked agent session)
  • File: voice-agent-livekit/tests/integration/test_multitenant_routing.py
  • Key assertions: For each of +18886030316, +14696152221, +13658254095, +14144007103 — assert resolve_route() returns the correct (agent_type, church_id) tuple AND load_church_data(church_id) returns the correct church_voice_agents row with the expected notification_phone and vertical
  • Effort: M
  • Dependency: Relies on test_load_church_data_integration.py pattern (already on main) — extend to add per-number assertions
  • Priority: P1

G11 — LLM fallback chain test

  • Gaps closed: Near-miss-D (Anthropic-only single point of failure)
  • Type: Unit pytest (mocked LLM providers)
  • File: voice-agent-livekit/tests/test_llm_fallback.py
  • Key assertions: (a) Anthropic disabled → Gemini fires; (b) both timeout → keyword-based fallback fires; (c) no path results in silent dead air
  • Effort: M
  • Dependency: None — pure Python mocking
  • Priority: P1

G12 — Inbound trunk lock test (CI-blocking)

  • Gaps closed: Unauthorized edit to ST_Xa3Bp9aixRFP config
  • Type: CI check (runs on every PR)
  • File: Add check to voice-health cron OR add new voice-trunk-lock-check.yml CI workflow
  • Key assertion: LiveKit listSipInboundTrunk() returns ST_Xa3Bp9aixRFP with exactly the four expected numbers and no auth changes. If any diff from EXPECTED in voice-health/route.ts → CI fails + founder alert
  • Effort: S
  • Dependency: None — extend existing voice-health cron check logic
  • Priority: P1

G13 — Cartesia voice_id format validation

  • Gaps closed: Near-miss-C (silent wrong-voice fallback)
  • Type: Static contract (Python)
  • File: voice-agent-livekit/tests/test_voice_id_format.py
  • Key assertions: (a) voice_id from church_voice_agents.cartesia_voice_id matches UUID4 pattern [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}; (b) reject ElevenLabs-format IDs (alphanumeric, no hyphens); (c) every voice_id in knowledge/references/cartesia-voices/index.json is in the Cartesia catalog
  • Effort: S
  • Dependency: None
  • Priority: P2

G14 — Self-dial loop detection test

  • Gaps closed: P1 from 07-DAY4-HANDOFF.md §7 (#9)
  • Type: Unit pytest
  • File: voice-agent-livekit/tests/test_self_dial_detection.py
  • Key assertions: execute_attended_transfer() returns reason='self_dial', success=False when target_number resolves to any of +18886030316, +14696152221, +13658254095, +14144007103 (our own DIDs); legitimate external numbers pass through
  • Effort: S
  • Dependency: core/transfer.py must implement self-dial guard first (Day 4 open follow-up)
  • Priority: P1

G15 — STT keyterms boost test

  • Gaps closed: Layer 5 coverage gap
  • Type: Manual verification with synthetic audio fixture
  • Key assertion: Play audio of "theophany," "transubstantiation," "Wesleyan," etc. → assert transcript contains the term correctly (not a phonetically similar but wrong word)
  • Effort: M
  • Dependency: Requires Deepgram keyterms API test environment
  • Priority: P2

G16 — Multi-agent (Coordinator → Care) handoff regression test

  • Gaps closed: Agent handoff boundary failures
  • Type: Unit pytest (mocked session transfer)
  • File: voice-agent-livekit/tests/test_agent_handoff.py
  • Key assertions: (a) CoordinatorAgent delegates pastoral topic to CareAgent; (b) CareAgent receives correct session context; (c) tools registered on CareAgent are accessible after handoff; (d) CoordinatorAgent tools do not persist on CareAgent session
  • Effort: M
  • Dependency: None
  • Priority: P1

G17 — demo_dial_log count integrity test

  • Gaps closed: Rate-limiter counting FAILED handshakes (Day 4 open follow-up #7)
  • Type: Unit pytest (mocked Supabase)
  • File: voice-agent-livekit/tests/test_demo_rate_limiter.py
  • Key assertions: (a) dial log row inserted only on participant JOIN (not on token mint); (b) 3 rows per IP per day blocks fourth attempt; (c) failed handshake does NOT increment count
  • Effort: M
  • Dependency: None
  • Priority: P1

§5 — Test Cadence + Ownership

On every PR (gate — blocks merge)

TestFileWhat it gates
voice-tool-schemas.yml (ruff F821 + AST annotation walker).github/workflows/voice-tool-schemas.ymlAny PR touching voice-agent-livekit/**/*.py — catches P0-1, P0-2
voice-routing-integration-on-pr.yml (routing unit + live Supabase).github/workflows/voice-routing-integration-on-pr.ymlAny PR touching session.py, main.py, or verticals/*/integrations/** — catches FK/RLS regressions
voice-behavioral-critical-on-pr.yml (behavioral critical subset).github/workflows/voice-behavioral-critical-on-pr.ymlAny PR touching voice agent Python code — behavioral smoke
cold-outreach-director-transfer.yml (Playwright round-trip).github/workflows/cold-outreach-director-transfer.ymlPRs touching src/components/cold-outreach/**, src/app/api/livekit/token/**, voice-agent-livekit/core/transfer.py, voice-agent-livekit/verticals/*/agents.py ⚠️ not yet created
crisis-pathway gate (test_transfer_crisis_gate.py)voice-agent-livekit/tests/test_transfer_crisis_gate.pyPRs touching core/transfer.py, safety.py, moderation.py — LIFE-SAFETY mandatory
test_escalation_routing.py (102-msg two-track contract)voice-agent-livekit/tests/test_escalation_routing.pyPRs touching core/escalation.py, safety.py, moderation.py, verticals/*/prompts.py — LIFE-SAFETY

Proposed: voice-critical-path-gate workflow — mirrors critical-path-gate.yml logic but specific to voice. Gates all voice-related PRs on passing cold-outreach-director-transfer.spec.ts Playwright artifact AND static contract tests (voice-tool-schemas.yml + test_transfer_sip_payload_shape.py). Applies the existing critical-path-override label escape hatch with a logged reason.

On every voice agent deploy (post-deploy smoke — within 90s of lk agent deploy)

  1. lk agent logs --log-type deploy — assert "registered worker" appears within 90s
  2. Manual or scripted call to a demo line — assert agent greets caller (proves dispatch working)
  3. (Future, G5) automated outbound-dial to Telnyx echo number — assert participant joins room within 30s
  4. If any check fails: DO NOT declare deploy successful. Re-run lk agent deploy (livekit/agents#3104 fix pattern). If failure persists after two deploys, escalate to founder. Reference: memory/feedback_livekit_recovery_lk_deploy_only.md.

Daily (crons)

CronFileCadenceWhat it checks
cron-voice-healthsrc/app/api/cron/voice-health/route.tsEvery 15 minLiveKit inbound trunk config, dispatch rules, agent_name
Telnyx carrier state extension (G4)extend voice-health/route.tsEvery 15 minTelnyx outbound_voice_profile_id bound, DID-to-connection binding ⚠️ not yet implemented
Daily outbound-dial cert (G5)src/app/api/cron/voice-outbound-cert/route.tsDailyReal dial to Telnyx echo number, assert room join ⚠️ not yet implemented
voice-behavioral-nightly-church.yml.github/workflows/voice-behavioral-nightly-church.ymlNightly 06:00 UTCChurch vertical behavioral suite (Haiku judge)

Weekly (scheduled)

  • voice-behavioral-funeral.yml — funeral vertical behavioral scenarios
  • voice-clients-drift.yml — voice-clients YAML drift detection

Manual (on trigger)

  • Full 10-item founder-supervised live verification — before any cold-email batch GO/NO-GO
  • Crisis pathway live test (item 5 in 06-DAY3-HANDOFF.md §6) — call demo line, say crisis phrase, assert 988 routing + DB row + no SMS
  • Regression across all 4 customer lines (item 6) — verify each answers correctly

Critical-path registry entries (existing, tests/registry.yaml)

  • voice-live-callcritical_path: true, spec_file: null ⚠️ spec not yet authored (the Playwright round-trip G1 will close this)
  • voice-routing-integrationcritical_path: true, spec_file: null — covered by pytest workflow (not Playwright)
  • voice-behavioral-nightlycritical_path: false, nightly behavioral suite

§6 — Acceptance Criteria — When is the Voice Agent "Hardened"?

The founder uses this checklist to make the cold-email batch GO/NO-GO call. Every item must be provably GREEN before the call. "Provably" means an artifact (PR link, CI run link, file path, SQL query result) that a human or agent can inspect.

  • G1 — Round-trip Playwright spec cold-outreach-director-transfer.spec.ts is green on the foundation Vercel preview alias. Artifact: CI run link on cold-outreach-director-transfer.yml showing green status.
  • G2+G3 — Static contract tests merged to main. test_function_tool_schemas.py + voice-tool-schemas.yml + test_transfer_sip_payload_shape.py are on feat/verticals-platform-day1-foundation and voice-tool-schemas.yml CI is green. Artifact: commit SHA on foundation branch.
  • G7 — WebRTC↔SIP branch test test_transfer_browser_branch.py is green. Artifact: pytest run output.
  • G4 — Voice-health cron Telnyx extension is deployed and has run at least once without issuing a critical alert. Artifact: voice-health cron run showing outbound_voice_profile_id check passing.
  • G5 — Daily outbound-dial cert passes for 7 consecutive days. Artifact: 7 consecutive cron run logs showing Telnyx echo participant joined the room.
  • Crisis pathway test (G9): test_transfer_crisis_gate.py passes on every voice-agent deploy (already on main for keyword-explicit phrases). End-to-end crisis test (G9) is green: crisis_events row written, no voice_callback_requests, no transfer_to_director call. Artifact: pytest output + Supabase query SELECT * FROM crisis_events ORDER BY created_at DESC LIMIT 1.
  • Multi-tenant routing test (G10): test_load_church_data_integration.py passes against all 4 production numbers. Artifact: CI run output showing each number resolves to correct church.
  • LLM fallback chain test (G11): Anthropic timeout → Gemini fires. Artifact: pytest run showing fallback fires.
  • Self-dial loop detection (G14): test_self_dial_detection.py passes. Artifact: pytest run output.
  • All static contract tests green in voice-tool-schemas.yml CI workflow. Artifact: CI run link.
  • Inbound trunk lock test (G12): voice-health cron with trunk lock assertion is deployed and green. Artifact: cron run log showing ST_Xa3Bp9aixRFP config unchanged.
  • Voice-provisioning runbook (knowledge/runbooks/voice-provisioning.md) references the three Telnyx requirements (credential + outbound voice profile + DID-to-connection binding) AND the first-dial certification step. Artifact: file path + grep for "three requirements" and "first-dial".
  • Memory files referenced from the runbook: memory/feedback_round_trip_test_before_merge.md and memory/feedback_telnyx_outbound_three_requirements.md are linked from voice-provisioning.md AND from the onboarding docs any new contributor reads first. Artifact: grep of runbook for memory file names.

§7 — How to Use This Document

Before touching voice code: Read §1 to understand which layer you are working in. Read §2 to know what failure modes have already burned this project in that layer. If your change touches layers that have ⚠️ marks in §3, you must either build the missing test as part of your PR (using §4 priority and file path), or carry a critical-path-override label with a documented reason.

Before opening a PR: Check §3 for every layer your PR touches. If that layer's test status is "static-only" or "⚠️ none," your PR must include the corresponding §4 gap closure OR an explicit waiver. The voice-critical-path-gate workflow (proposed) will enforce this for the highest-priority gaps once implemented.

Before merging a critical-path voice PR: §5 cadence defines which tests must pass. The minimum bar is:

  1. voice-tool-schemas.yml green
  2. voice-routing-integration-on-pr.yml green
  3. cold-outreach-director-transfer.yml green (once spec exists — G1)
  4. No LIFE-SAFETY test failures (test_escalation_routing.py, test_transfer_crisis_gate.py)

Before founder approves a voice-related ship: Walk §6 acceptance criteria. Each item must have an artifact. "Looks good" and "build passes" are not artifacts.

The 8-P0 heuristic: If your PR changes behavior at Layer 1-11 but tests only Layers 4-8 with stubs, you are shipping a PR like PR #251. The specific question to ask before merge: "Is there at least one test that will fail if I introduce a regression at Layer 1, 2, 3, 9, 10, or 11?" If the answer is no, do not merge.


§8 — Living-Document Protocol

This document is updated whenever:

  • A new P0 is found in production: Add a row to §2 table with the layer mapping. Immediately assess which §3 entry should have caught it and move to §4 with P0 priority.
  • A new test lands: Move the corresponding entry from §4 to §3. Update §3 with the file path, "on main" status, and any layer limitations. Remove the ⚠️ from §3 entries the test now covers.
  • A new layer is added to the stack: Add a definition to §1 (renumber if needed). Add a §3 entry. Add a §4 gap if the layer is untested.
  • The worktree tests (G2, G3) merge to main: Update §3 Layer 4 + Layer 8 entries to remove the "in worktree" qualifier.
  • Acceptance criteria (§6) items are met: Check the box and add the artifact link.

Owner: The orchestrator on each Day-N session is responsible for updating this document before ending the session if any §2, §3, or §6 item changed.

Snapshot freshness: The last-verified frontmatter field is updated when §3 is re-confirmed against actual code in voice-agent-livekit/tests/ and .github/workflows/. Current last-verified: 2026-04-30 reflects the state of the worktree at the start of Day 4.

The core principle (from memory/feedback_round_trip_test_before_merge.md):

Any PR that ships a customer-facing browser demo, live transfer mechanic, WebRTC↔SIP bridge, or anything where the integration spans browser mic → LiveKit Cloud → agent runtime → STT → LLM → tool call → carrier → callee → bridge MUST include AT LEAST ONE end-to-end Playwright spec that exercises the real round-trip. Stubbed unit tests are insufficient.

This is not a preference. It is the lesson written in 7 hours of live debugging and 8 P0 regressions on a day when the founder said: "The better the demos and the more robust the product, the conversions will be way higher so it's worth spending a few days on it to get it right."