Voice Agent Hardening Test Plan

Preamble — Why This Document Exists

Day 3 of the verticals-first platform (2026-04-29) surfaced eight P0 regressions, all from a single PR (#251), in a single founder-supervised 7-hour verification session. Every bug had been present at merge time. Every bug was undetectable by the tests that existed at merge time, because every test stubbed the layer where the bug lived. The structural diagnosis:

PR #251 surface area: browser → mic → LiveKit → agent → STT → LLM → tool → SIP API → carrier → callee → bridge
Tests in PR #251:    [stub] [stub]  [stub]   [stub] [stub] [stub] [stub]  [stub]   [stub]  [n/a]  [n/a]
Bugs caught by tests: 0
Bugs found in live test: 8

This document is the founder's, the next agent's, and every contributor's map of "is the voice agent actually robust?" It must be read before touching voice code, consulted before opening a PR that touches voice layers, and updated whenever a new failure mode is discovered.

Memory files (mandatory background reading before any voice work):

memory/feedback_round_trip_test_before_merge.md — the 8-P0 post-mortem; the core argument for round-trip Playwright gating
memory/feedback_telnyx_outbound_three_requirements.md — the three Telnyx provisioning requirements + PATCH gotcha
memory/feedback_robustness_over_velocity.md — founder priority: conversion-quality demos justify extra days
memory/feedback_livekit_recovery_lk_deploy_only.md — lk agent restart is insufficient; only lk agent deploy recovers
memory/feedback_lk_overwrite_flag_destroys_secrets.md — --overwrite nukes all 22 production secrets

§1 — The 11 Layers (with definitions and concrete examples)

The voice call stack has eleven distinct layers. A feature spanning multiple layers must have non-stubbing test coverage at every layer it touches. "Stubbing a layer" means the test replaces the real system at that layer with a mock or no-op, rendering failures at that layer invisible.

Layer 1 — Browser DOM + getUserMedia

What lives here: The browser page, the JavaScript/React component, the getUserMedia({ audio: true }) call that requests microphone permission, and the <audio> DOM element that receives incoming audio tracks. This is the user's entire experience until the voice call is established.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the handleStartCall() function that runs navigator.mediaDevices.getUserMedia({ audio: true }) before connecting to LiveKit.

What stubbing looks like: A test that mounts the component without actually loading it in a Chromium browser — e.g., a React unit test with JSDOM that patches navigator.mediaDevices. JSDOM's getUserMedia returns a resolved promise instantly, bypassing real permission timing and real audio track events.

Layer 2 — Mic input → WebRTC track publish

What lives here: The WebRTC MediaStreamTrack obtained from getUserMedia, the LiveKit SDK call to room.localParticipant.setMicrophoneEnabled(true) or publishTrack(), and the timing between permission grant and agent dispatch. This layer is where "user's mic input actually reaches the agent" is determined.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx lines around setMicrophoneEnabled(true) — the fix for P0 #4 moved this call before room.connect(), ensuring mic permission is granted before the agent is dispatched.

What stubbing looks like: A test that calls room.connect() with a mock Room object whose setMicrophoneEnabled is a no-op. The mock always "succeeds" in zero milliseconds; the real bug (permission prompt fires AFTER agent dispatched, causing a mic-publish race) is invisible.

Layer 3 — LiveKit Cloud room + signaling

What lives here: The LiveKit Cloud room as a service: room creation, WebRTC signaling, participant join/leave events, track subscription events (RoomEvent.TrackSubscribed), and the room's TURN relay infrastructure. This is the real-time switching fabric.

Concrete file: src/components/cold-outreach/DirectorTransferDemo.tsx — the room.on(RoomEvent.TrackSubscribed, (track, _, participant) => { ... }) handler added in commit fe3f07a7 to attach remote audio to a DOM <audio> element.

What stubbing looks like: A test that instantiates a mock Room and fires synthetic RoomEvent.TrackSubscribed events on a timer. The real bug (handler was entirely missing — no DOM <audio> element ever appeared, so callers heard silence from the AI) is invisible because the mock fires the event regardless.

Layer 4 — Agent runtime / dispatcher (Python)

What lives here: The LiveKit Python agent process (main.py), the @server.rtc_session handler, the AgentSession, JobContext, agent class instantiation, and the LiveKit named-dispatch mechanism (agent_name="churchwiseai-voice"). Also the dispatch rule (SDR_cYzx7sAkUTvx, SDR_Wpyno7GDNQqg) that routes calls to this agent.

Concrete file: voice-agent-livekit/main.py — the entrypoint; voice-agent-livekit/session.py — the session lifecycle and resolve_route() function.

What stubbing looks like: A test that imports agent classes and calls methods on them directly without ever spinning up the LiveKit agent runtime. Fine for testing Python logic; invisible to runtime-level failures like livekit/agents#3104 (named-dispatch hang where lk agent list shows "Available" but no worker is registered).

Layer 5 — STT (Deepgram via LiveKit plugin)

What lives here: The Deepgram Nova-3 real-time speech-to-text transcription, the LiveKit Deepgram plugin configuration, keyterms boost (tradition-specific theological terminology), and the TranscriptionSegment objects that arrive as conversation turns.

Concrete file: voice-agent-livekit/main.py — deepgram.STT(model="nova-3", ...) instantiation with keyterms= list.

What stubbing looks like: A test that creates a mock STT output with pre-written TranscriptionSegment objects. The real mic-to-transcript pipeline (audio codec → Deepgram → transcript) never runs; acoustic failures and keyterms boost effects are invisible.

Layer 6 — LLM (Anthropic / Gemini / Groq disabled)

What lives here: The LLM API call (Claude Haiku 4.5 primary, Gemini 2.5 Flash fallback), the tool schema construction from @function_tool-annotated Python methods, the parse_function_tools() call that uses typing.get_type_hints() to build JSON schemas, and the ModelSettings with caching and system prompt injection.

Concrete file: voice-agent-livekit/verticals/church/agents.py — every @function_tool-decorated method; voice-agent-livekit/safety.py — flag_safety_event(context: RunContext, ...).

What stubbing looks like: A test that mocks the llm.LLM object and returns hard-coded llm.ChatChunk objects. The real parse_function_tools() execution — which crashed with KeyError on Day 3 P0 #1 because flag_safety_event lacked a type annotation — never runs. The AST-based test_function_tool_schemas.py is specifically designed to catch this class of bug without mocking.

Layer 7 — Tool call registration + invocation

What lives here: The registration of @function_tool methods onto agent class instances, the LiveKit framework's dispatch of LLM tool call requests to the correct method, and the agent's routing of tool calls across agent handoff boundaries (e.g., CoordinatorAgent handing a call to CareAgent).

Concrete file: voice-agent-livekit/verticals/church/agents.py — CoordinatorAgent class (line 447+), CareAgent class (line 140+); specifically, transfer_to_director is defined on CareAgent (line 354) and was added to CoordinatorAgent in commit 067c7c8f to fix P0 #5.

What stubbing looks like: A test that calls agent.transfer_to_director(...) directly on a specific class. If only CareAgent.transfer_to_director is tested, the bug (method missing from CoordinatorAgent, funeral path uses CoordinatorAgent, LLM hallucinated an alternate name and got "unknown AI function") is invisible.

Layer 8 — SIP outbound API (`CreateSIPParticipant`, `TransferSIPParticipant`)

What lives here: The LiveKit Python SDK calls lk_api.CreateSIPParticipantRequest(...) and lk_api.TransferSIPParticipantRequest(...) in core/transfer.py, and the field shape requirements enforced by LiveKit server-side validation (livekit/protocol/livekit/sip.go). This is where "dial the director's phone via the outbound SIP trunk" actually happens.

Concrete file: voice-agent-livekit/core/transfer.py — execute_attended_transfer() function, lines ~460-595.

What stubbing looks like: A test that patches lk_api.CreateSIPParticipantRequest to accept any kwargs. The real validation rule — sip_call_to must be a bare phone number or SIP user (not a full sip:user@domain URI); transfer_to must have a URI scheme prefix (tel:+E164) — never runs. P0 #6 (TwirpError: SipCallTo should be a phone number or SIP user, not a full SIP URI) is invisible.

Layer 9 — Carrier (Telnyx / Twilio)

What lives here: The carrier-side state for every outbound SIP trunk: Telnyx credential connection authentication, outbound voice profile binding (outbound_voice_profile_id), DID-to-connection binding, and the actual PSTN network reach. This layer is entirely outside the codebase; it lives in the Telnyx dashboard and API.

Concrete file: Not a code file — this is the Telnyx credential connection 2948197312620398250. Verified via GET https://api.telnyx.com/v2/credential_connections/2948197312620398250.

What stubbing looks like: Any test that considers LiveKit's CreateSIPParticipant returning a participant_id as proof that the call will connect. LiveKit returns a participant_id the moment the SIP INVITE is sent; Telnyx's silent 403/D35 rejection (caused by null outbound_voice_profile_id in P0 #7) happens asynchronously and is invisible to the SDK call.

Layer 10 — Callee (PSTN ringer reaching real phone)

What lives here: The real phone that rings when the director is dialed — the founder's cell, a demo echo number, a funeral director's on-call phone. This layer is verified only by a human hearing their phone ring.

Concrete file: N/A — this is physical telephony infrastructure. Test substitute: a Telnyx echo number that auto-answers, says nothing, and hangs up (proves carrier connectivity at <$0.01/test).

What stubbing looks like: Any test that does not actually dial a number and verify it rings. All automated tests below Layer 9 stub this layer.

Layer 11 — Audio bridge (REFER vs room-native mixing)

What lives here: The bridge mechanic that connects the two legs (caller + director) after the transfer: SIP REFER (TransferSIPParticipant) for PSTN-caller paths, or LiveKit room-native audio mixing (agent leaves room, browser ↔ SIP-director connected by the room) for WebRTC-caller demo paths. This is where the architectural split between PSTN and browser demos occurs.

Concrete file: voice-agent-livekit/core/transfer.py — execute_attended_transfer() bridge step (lines ~580-600); voice-agent-livekit/verticals/church/agents.py — transfer_to_director() on both CoordinatorAgent and CareAgent.

What stubbing looks like: A test that asserts TransferSIPParticipant was called without checking the caller leg's transport type. P0 #8 — "no SIP session associated with participant" when TransferSIPParticipant is called for a WebRTC browser caller — is invisible because the mock accepts the call regardless.

§2 — Failure-Mode Catalog

The following table maps every confirmed production failure to its layer. "Static test that catches it now" means a test in the current main branch (or in the worktrees carrying Day 3 fixes). "Integration test that catches it now" means a real round-trip test, not a stub.

#	Bug	Layer	Symptom in production	Static test catches it now	Integration test catches it now
P0-1	`safety.py flag_safety_event(context)` missing type annotation → `KeyError` in `parse_function_tools` → both Anthropic + Google reject all LLM turns → dead air	L6 — LLM schema build	Agent greets caller; first user turn → 57s silence → caller hangs up	`test_function_tool_schemas.py` (AST-walks all `@function_tool` methods) — in worktree, not yet on main ⚠️	⚠️ (none yet)
P0-2	`_run_call` referenced `demo_director_phone_override` outside scope → `NameError` on funeral-prospect path	L4 — Agent runtime	Funeral prospect path throws `NameError` immediately; agent errors out	`voice-tool-schemas.yml` workflow (ruff F821 catches undefined name usage) — in worktree, not yet on main ⚠️	⚠️ (none yet)
P0-3	`DirectorTransferDemo.tsx` missing `RoomEvent.TrackSubscribed` handler → AI's TTS audio never reached browser DOM	L3 — LiveKit room events	Prospect clicks "Try live director"; browser call starts but they hear nothing from the AI	⚠️ (none yet)	⚠️ (none yet) — requires Playwright round-trip
P0-4	`setMicrophoneEnabled(true)` ran AFTER `room.connect()` → mic-publish race → agent dispatched before caller's audio tracked	L2 — Mic publish timing	Caller's voice never reaches agent; AI hears silence, cannot respond to what caller says	⚠️ (none yet)	⚠️ (none yet) — requires Playwright round-trip
P0-5	`transfer_to_director` on `CareAgent` only; funeral-prospect path uses `CoordinatorAgent` → LLM hallucinated tool name	L7 — Tool registration scope	LLM logs "unknown AI function `initiate_transfer`"; transfer never fires	⚠️ (none yet) — requires per-agent-class tool inventory check	⚠️ (none yet) — requires Playwright round-trip
P0-6	`sip_call_to=f"sip:{n}@{domain}"` (full SIP URI) → Telnyx rejects with `TwirpError: SipCallTo should be phone number not full SIP URI`	L8 — SIP API field shape	Director's phone never rings; LiveKit logs `TwirpError` synchronously	`test_transfer_sip_payload_shape.py` (6 assertions on field format) — in worktree, not yet on main ⚠️	⚠️ (none yet)
P0-7	Telnyx credential connection `outbound_voice_profile_id: null` → carrier silently 403/D35-rejects all outbound INVITEs	L9 — Carrier config	`CreateSIPParticipant` returns participant_id; director phone never rings; no MDR record	⚠️ (none yet — requires voice-health cron extension to check Telnyx API)	⚠️ (none yet — requires daily outbound dial cron to echo number)
P0-8	`TransferSIPParticipant` (SIP REFER) fails for WebRTC browser caller with "no SIP session associated with participant"	L11 — Bridge mechanic	Transfer initiated; immediate error; caller and director never connect	⚠️ (none yet — `test_transfer_sip_payload_shape.py` checks field shape but not caller-type branch)	⚠️ (none yet — requires Playwright with caller-type assertion)
Near-miss-A	`lk agent update-secrets --overwrite` would have nuked all 22 production secrets	L4 — Agent runtime	All 4 customer phone lines dead; no API keys; full outage	⚠️ (none — CLI flag; caught by interactive prompt before Enter)	⚠️ (none — operational hazard, not code bug)
Near-miss-B	livekit/agents#3104 named-dispatch hang — `lk agent list` shows "Available" but no worker registered	L4 — Agent runtime	Calls ring indefinitely; agent never answers; silent to LiveKit-side callers	`test_load_church_data_integration.py` (catches DB path failures but not agent-runtime hang)	⚠️ (none yet — requires post-deploy health assertion)
Near-miss-C	Cartesia voice `voice_id` silent default to "Katie" when ID not found	L5 (TTS config)	Customer hears wrong voice; tenant isolation broken	⚠️ (none yet — no voice_id format or presence validation test)	⚠️ (none yet)
Near-miss-D	`classify_call` Gemini-only single-point-of-failure	L6 — LLM fallback	If Gemini down, classification silently fails; no fallback chain	⚠️ (none yet — LLM fallback chain not tested)	⚠️ (none yet)
Prior-1	M2 migration dropped FK constraints → PostgREST join syntax in `_fetch_voice_agent_row` returned 400 → all dedicated-trunk demos routed to Sales Agent (~24h)	L4 — DB path in agent runtime	Every church number routes to sales agent; churches get generic sales pitch	`test_routing.py` (unit, mocked) — insufficient alone	`test_load_church_data_integration.py` (LIVE Supabase query against all demo + paying-customer UUIDs — catches schema regressions) — on main
Prior-2	`OUTBOUND_TRUNK_ID` env var not asserted at startup → empty string passed to `CreateSIPParticipantRequest` → silent dead call	L8 — SIP API config	Transfer attempted; LiveKit returns `not_found`; director never called	`test_transfer_env.py` (asserts `RuntimeError` on empty trunk ID in production) — on main	⚠️ (none yet)

§3 — Existing Test Surface (current state, 2026-04-30)

Tests are organized by layer. "On main" means the test is committed to the main branch (feat/verticals-platform-day1-foundation or main). "In worktree" means the test exists in a worktree branch that has not yet been merged to main.

Layer 1 — Browser DOM + getUserMedia

⚠️ No tests. DirectorTransferDemo.tsx has no unit tests. The component's DOM behavior (audio element creation, getUserMedia timing) is only verifiable via Playwright round-trip.

Layer 2 — Mic input → WebRTC track publish

⚠️ No tests. Mic-publish timing (the P0-4 fix) has no automated regression guard.

Layer 3 — LiveKit Cloud room + signaling

⚠️ No tests. RoomEvent.TrackSubscribed handler presence (the P0-3 fix) has no automated regression guard. Only verifiable via Playwright.

Layer 4 — Agent runtime / dispatcher (Python)

voice-agent-livekit/tests/test_routing.py — on main. Unit tests for resolve_route() covering every PHONE_REGISTRY entry. Mocked Supabase. Regression guard for the P0 routing failure.
voice-agent-livekit/tests/test_load_church_data_integration.py — on main. LIVE Supabase integration test; queries real production DB for every church_id in PHONE_REGISTRY; asserts load_church_data returns valid dict. Catches schema regressions (FK drops, RLS changes, column renames). Runs in voice-routing-integration-on-pr.yml CI.
voice-agent-livekit/tests/test_calls_limit.py — on main. Unit tests for CALLS_LIMIT_BY_PLAN, NULL-fallback path, and at_capacity flag.
.github/workflows/voice-routing-integration-on-pr.yml — on main. Triggers test_routing.py + test_load_church_data_integration.py + test_calls_limit.py on PRs touching voice-agent-livekit Python code.

Layer 5 — STT (Deepgram)

voice-agent-livekit/tests/test_audio_cache.py — on main. Tests core/audio_cache.py (audio cache lookup/miss, bridge phrases, thinking phrases, voice_name_for_id). Indirectly touches TTS wiring but not STT.
voice-agent-livekit/tests/test_audio_bridge.py — on main. Tests core/audio_bridge.py (EmotionDetector, BridgePlayer). Tests the bridge player that uses cached audio, not live Deepgram.
⚠️ No STT live-transcription tests. Keyterms boost, nova-3 model selection, and the real STT pipeline are not tested.

Layer 6 — LLM tool schema

voice-agent-livekit/tests/test_function_tool_schemas.py — in worktree agent-a2595426576a83769, not yet on main. AST-based contract test that walks every Python file in the voice agent package, finds all @function_tool-decorated methods, and asserts every parameter has a type annotation. Runs in <1s with no API keys. This is the test that would have caught P0-1 at PR time.
.github/workflows/voice-tool-schemas.yml — in worktree, not yet on main. CI workflow that runs ruff check --select F821 --target-version py312 (catches undefined names, P0-2) plus test_function_tool_schemas.py.

Layer 7 — Tool call registration + invocation

voice-agent-livekit/tests/test_church_info.py — on main. Tests church_info.py fallback formatters (used when PCO not configured). Does not test @function_tool registration or multi-agent routing.
voice-agent-livekit/tests/test_escalation_routing.py — on main. 102-message contract test for the two-track escalation (Track A operational vs Track B safety/crisis). Uses local regexes mirroring moderation.py. LIFE-SAFETY tagged; mandatory before merging changes to escalation paths.
⚠️ No tool-registration inventory test across both CoordinatorAgent and CareAgent. P0-5 (tool on wrong agent class) has no regression guard at this layer beyond test_function_tool_schemas.py (which only checks annotations, not which class has which method).

Layer 8 — SIP outbound API

voice-agent-livekit/tests/test_transfer_sip_payload_shape.py — in worktree agent-a2595426576a83769, not yet on main. Six assertions on CreateSIPParticipantRequest and TransferSIPParticipantRequest field shape, including sip_call_to must be bare phone/user (no @), transfer_to must have tel: or sip: prefix, and SDK field-name drift detection. Catches P0-6 at PR time.
voice-agent-livekit/tests/test_transfer_crisis_gate.py — on main. LIFE-SAFETY hard gate: asserts execute_attended_transfer() returns reason='crisis_gate', success=False for every crisis/DV/threat phrase. Regression guard for the hard-coded crisis block in core/transfer.py.
voice-agent-livekit/tests/test_transfer_env.py — on main. Asserts _resolve_outbound_trunk_id() raises RuntimeError in production when OUTBOUND_TRUNK_ID is empty. Catches silent dead-call from P0-2 (original code warned and proceeded).
voice-agent-livekit/tests/test_moderation.py — on main. Unit tests for moderation.py crisis/threat/abuse regex patterns. Verifies all crisis phrases are caught; verifies false-positive exclusions.

Layer 9 — Carrier (Telnyx / Twilio)

src/app/api/cron/voice-health/route.ts — on main. Runs every 15 minutes. Checks LiveKit-side state: inbound trunk ST_Xa3Bp9aixRFP presence and its four phone numbers, dispatch rule IDs and agent_name, outbound trunk ST_X3n9jxR55VrB presence. Reports HealthIssue objects with critical or warning severity. Fires P0 alerts via reportError(). Gap: does NOT check Telnyx-side carrier state — specifically, does not verify outbound_voice_profile_id is set on credential connection 2948197312620398250. P0-7 would have surfaced here if this check existed.
⚠️ No daily outbound-dial certification. There is no automated test that actually dials a Telnyx echo number end-to-end to prove the carrier path works.

Layer 10 — Callee (PSTN ringer)

⚠️ No automated tests. This layer is only testable with real telephony. Current approach: manual founder-supervised verification sessions.

Layer 11 — Audio bridge

voice-agent-livekit/tests/test_transfer_sip_payload_shape.py — in worktree. Asserts field shapes on TransferSIPParticipantRequest. Does not assert whether TransferSIPParticipant should be called at all based on caller leg transport type.
⚠️ No WebRTC-caller-branch test. The architectural fix for P0-8 (detect ParticipantKind.STANDARD for WebRTC callers and skip TransferSIPParticipant) has no regression guard.

Cross-layer behavioral tests

voice-agent-livekit/tests/behavioral/ — on main. Behavioral test suite covering church and funeral verticals. Uses LLM-as-judge (Haiku) against scripted scenarios.
.github/workflows/voice-behavioral-nightly-church.yml, voice-behavioral-funeral.yml, voice-behavioral-critical-on-pr.yml — on main. Nightly and on-PR behavioral runs.
.github/workflows/voice-clients-drift.yml — on main. Voice client YAML drift detection.

§4 — Gap Closure Roadmap

Prioritized by founder-quality framing. P0 = blocks cold-email GO/NO-GO. P1 = blocks production confidence. P2 = important but not blocking.

G1 — Round-trip Playwright spec `cold-outreach-director-transfer.spec.ts`

Gaps closed: P0-3 (audio element), P0-4 (mic timing), P0-5 (wrong agent class), P0-8 (WebRTC bridge branch), near-miss-B (post-deploy health)
Type: Playwright e2e against deployed Vercel preview URL (NOT localhost)
File: churchwiseai-web/e2e/cold-outreach-director-transfer.spec.ts
CI workflow: .github/workflows/cold-outreach-director-transfer.yml
Trigger: PRs touching src/components/cold-outreach/**, src/app/api/livekit/token/**, voice-agent-livekit/core/transfer.py, voice-agent-livekit/verticals/*/agents.py
Key assertions: (a) audio element appears in DOM after TrackSubscribed; (b) voice_call_logs.transcript contains both role='assistant' and role='user' within 60s; (c) for WebRTC-caller path, TransferSIPParticipant is NOT called; (d) SIP participant joins room; (e) AI agent audio muted/left after bridge intro
Effort: L
Dependency: P0-8 architectural fix (Day 4 §4.1) must land first; requires Telnyx echo number env var PLAYWRIGHT_ECHO_NUMBER
Priority: P0
In flight: Lane B (Day 4) — spec skeleton described in 07-DAY4-HANDOFF.md §4.2

G2 — Merge worktree tests to main: `test_function_tool_schemas.py` + `voice-tool-schemas.yml`

Gaps closed: P0-1 (@function_tool annotation completeness), P0-2 (ruff F821 undefined names)
Type: Static contract (AST-based, no API keys required)
File: voice-agent-livekit/tests/test_function_tool_schemas.py, .github/workflows/voice-tool-schemas.yml
Effort: S (tests exist in worktree agent-a2595426576a83769; merge to foundation branch)
Dependency: None — self-contained
Priority: P0
In flight: Exists in worktree, pending merge to feat/verticals-platform-day1-foundation

G3 — Merge worktree test to main: `test_transfer_sip_payload_shape.py`

Gaps closed: P0-6 (SIP URI field shape), SDK field-name drift
Type: Static contract (Python, mocked LiveKit SDK)
File: voice-agent-livekit/tests/test_transfer_sip_payload_shape.py
Effort: S (test exists in worktree agent-a2595426576a83769; merge to foundation branch)
Dependency: None
Priority: P0
In flight: Exists in worktree, pending merge

G4 — Voice-health cron Telnyx carrier config extension

Gaps closed: P0-7 (outbound_voice_profile_id null)
Type: Synthetic cron (HTTP to Telnyx API)
File: src/app/api/cron/voice-health/route.ts — extend existing cron
Key assertion: GET /v2/credential_connections/2948197312620398250 → .data.outbound.outbound_voice_profile_id must not be null AND phone number +12268830526 (or equivalent) must have connection_id == 2948197312620398250
Effort: M
Dependency: TELNYX_API_KEY env var in Vercel production (already set per runbooks)
Priority: P1
In flight: Day 4 open follow-up 07-DAY4-HANDOFF.md §7

G5 — Daily outbound-trunk dial certification cron

Gaps closed: P0-7 (carrier-side silent rejection), Near-miss-B (agent registration)
Type: Synthetic cron (real outbound dial to Telnyx echo number)
File: src/app/api/cron/voice-outbound-cert/route.ts (new)
Key assertion: Dial TELNYX_ECHO_NUMBER via lk sip participant create --trunk ST_X3n9jxR55VrB; assert participant joins LiveKit room within 30s; assert participant disconnects cleanly; total cost <$0.01 per run
Effort: M
Dependency: Telnyx echo number provisioned; TELNYX_ECHO_NUMBER env var in Vercel; keep dials OFF the founder's cell
Priority: P1
In flight: Day 4 open follow-up 07-DAY4-HANDOFF.md §7

G6 — Voice agent boot smoke (post-deploy health assertion)

Gaps closed: Near-miss-B (livekit/agents#3104 silent registration failure)
Type: Integration check (scripted as post-deploy step)
File: Add to voice agent deploy runbook knowledge/runbooks/voice-provisioning.md + knowledge/runbooks/voice-ops/voice-agent-debug.md
Key assertion: After lk agent deploy, within 90s, lk agent logs --log-type deploy contains "registered worker"; if not present after 90s → escalate; if present → green
Effort: S (already in CLAUDE.md; needs automated script and runbook)
Dependency: None
Priority: P1

G7 — WebRTC↔SIP bridge branch test (`test_transfer_browser_branch.py`)

Gaps closed: P0-8 (architectural)
Type: Unit pytest (mocked ParticipantKind, mocked LiveKit room)
File: voice-agent-livekit/tests/test_transfer_browser_branch.py
Key assertions: (a) WebRTC caller → TransferSIPParticipant NOT called; (b) SIP caller → TransferSIPParticipant IS called; (c) crisis gate applies regardless of caller transport type
Effort: M
Dependency: P0-8 architectural fix (Day 4 §4.1) must land first
Priority: P0
In flight: Lane A (Day 4) per 07-DAY4-HANDOFF.md §4.1

G8 — Per-agent tool inventory contract test

Gaps closed: P0-5 (tool on wrong agent class)
Type: Static contract (Python reflection)
File: voice-agent-livekit/tests/test_agent_tool_inventory.py
Key assertion: Assert that a pre-defined set of tools (including transfer_to_director) are registered on BOTH CoordinatorAgent AND CareAgent. Extend to FuneralCoordinatorAgent and any future agent class.
Effort: S
Dependency: None
Priority: P1

G9 — Crisis pathway end-to-end test

Gaps closed: Life-safety regression (ensure 988 routing, no transfer, no callback SMS)
Type: Integration pytest (against LIVE agent via scripted session with mock STT)
File: voice-agent-livekit/tests/integration/test_crisis_pathway.py
Key assertions: (a) Caller says "I want to end my life" → agent recites 988; (b) crisis_events row written with correct source and vertical; (c) NO voice_callback_requests row written; (d) NO transfer_to_director tool call logged; (e) NO SMS to notification_phone; (f) conversation continues (AI stays on line)
Effort: L
Dependency: Requires voice_tool_calls audit table OR lk agent logs post-hoc parsing; requires mock STT input capability
Priority: P0 (LIFE-SAFETY)
Note: test_transfer_crisis_gate.py covers the Python gate logic (static); this closes the end-to-end gap

G10 — Multi-tenant routing test (all 4 production lines)

Gaps closed: Per-church config isolation regression
Type: Integration pytest (LIVE Supabase + mocked agent session)
File: voice-agent-livekit/tests/integration/test_multitenant_routing.py
Key assertions: For each of +18886030316, +14696152221, +13658254095, +14144007103 — assert resolve_route() returns the correct (agent_type, church_id) tuple AND load_church_data(church_id) returns the correct church_voice_agents row with the expected notification_phone and vertical
Effort: M
Dependency: Relies on test_load_church_data_integration.py pattern (already on main) — extend to add per-number assertions
Priority: P1

G11 — LLM fallback chain test

Gaps closed: Near-miss-D (Anthropic-only single point of failure)
Type: Unit pytest (mocked LLM providers)
File: voice-agent-livekit/tests/test_llm_fallback.py
Key assertions: (a) Anthropic disabled → Gemini fires; (b) both timeout → keyword-based fallback fires; (c) no path results in silent dead air
Effort: M
Dependency: None — pure Python mocking
Priority: P1

G12 — Inbound trunk lock test (CI-blocking)

Gaps closed: Unauthorized edit to ST_Xa3Bp9aixRFP config
Type: CI check (runs on every PR)
File: Add check to voice-health cron OR add new voice-trunk-lock-check.yml CI workflow
Key assertion: LiveKit listSipInboundTrunk() returns ST_Xa3Bp9aixRFP with exactly the four expected numbers and no auth changes. If any diff from EXPECTED in voice-health/route.ts → CI fails + founder alert
Effort: S
Dependency: None — extend existing voice-health cron check logic
Priority: P1

G13 — Cartesia voice_id format validation

Gaps closed: Near-miss-C (silent wrong-voice fallback)
Type: Static contract (Python)
File: voice-agent-livekit/tests/test_voice_id_format.py
Key assertions: (a) voice_id from church_voice_agents.cartesia_voice_id matches UUID4 pattern [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}; (b) reject ElevenLabs-format IDs (alphanumeric, no hyphens); (c) every voice_id in knowledge/references/cartesia-voices/index.json is in the Cartesia catalog
Effort: S
Dependency: None
Priority: P2

G14 — Self-dial loop detection test

Gaps closed: P1 from 07-DAY4-HANDOFF.md §7 (#9)
Type: Unit pytest
File: voice-agent-livekit/tests/test_self_dial_detection.py
Key assertions: execute_attended_transfer() returns reason='self_dial', success=False when target_number resolves to any of +18886030316, +14696152221, +13658254095, +14144007103 (our own DIDs); legitimate external numbers pass through
Effort: S
Dependency: core/transfer.py must implement self-dial guard first (Day 4 open follow-up)
Priority: P1

G15 — STT keyterms boost test

Gaps closed: Layer 5 coverage gap
Type: Manual verification with synthetic audio fixture
Key assertion: Play audio of "theophany," "transubstantiation," "Wesleyan," etc. → assert transcript contains the term correctly (not a phonetically similar but wrong word)
Effort: M
Dependency: Requires Deepgram keyterms API test environment
Priority: P2

G16 — Multi-agent (Coordinator → Care) handoff regression test

Gaps closed: Agent handoff boundary failures
Type: Unit pytest (mocked session transfer)
File: voice-agent-livekit/tests/test_agent_handoff.py
Key assertions: (a) CoordinatorAgent delegates pastoral topic to CareAgent; (b) CareAgent receives correct session context; (c) tools registered on CareAgent are accessible after handoff; (d) CoordinatorAgent tools do not persist on CareAgent session
Effort: M
Dependency: None
Priority: P1

G17 — `demo_dial_log` count integrity test

Gaps closed: Rate-limiter counting FAILED handshakes (Day 4 open follow-up #7)
Type: Unit pytest (mocked Supabase)
File: voice-agent-livekit/tests/test_demo_rate_limiter.py
Key assertions: (a) dial log row inserted only on participant JOIN (not on token mint); (b) 3 rows per IP per day blocks fourth attempt; (c) failed handshake does NOT increment count
Effort: M
Dependency: None
Priority: P1

§5 — Test Cadence + Ownership

On every PR (gate — blocks merge)

Test	File	What it gates
`voice-tool-schemas.yml` (ruff F821 + AST annotation walker)	`.github/workflows/voice-tool-schemas.yml`	Any PR touching `voice-agent-livekit/*/.py` — catches P0-1, P0-2
`voice-routing-integration-on-pr.yml` (routing unit + live Supabase)	`.github/workflows/voice-routing-integration-on-pr.yml`	Any PR touching `session.py`, `main.py`, or `verticals//integrations/*` — catches FK/RLS regressions
`voice-behavioral-critical-on-pr.yml` (behavioral critical subset)	`.github/workflows/voice-behavioral-critical-on-pr.yml`	Any PR touching voice agent Python code — behavioral smoke
`cold-outreach-director-transfer.yml` (Playwright round-trip)	`.github/workflows/cold-outreach-director-transfer.yml`	PRs touching `src/components/cold-outreach/`, `src/app/api/livekit/token/`, `voice-agent-livekit/core/transfer.py`, `voice-agent-livekit/verticals//agents.py` ⚠️ not yet created*
`crisis-pathway gate` (test_transfer_crisis_gate.py)	`voice-agent-livekit/tests/test_transfer_crisis_gate.py`	PRs touching `core/transfer.py`, `safety.py`, `moderation.py` — LIFE-SAFETY mandatory
`test_escalation_routing.py` (102-msg two-track contract)	`voice-agent-livekit/tests/test_escalation_routing.py`	PRs touching `core/escalation.py`, `safety.py`, `moderation.py`, `verticals/*/prompts.py` — LIFE-SAFETY

Proposed: voice-critical-path-gate workflow — mirrors critical-path-gate.yml logic but specific to voice. Gates all voice-related PRs on passing cold-outreach-director-transfer.spec.ts Playwright artifact AND static contract tests (voice-tool-schemas.yml + test_transfer_sip_payload_shape.py). Applies the existing critical-path-override label escape hatch with a logged reason.

On every voice agent deploy (post-deploy smoke — within 90s of `lk agent deploy`)

lk agent logs --log-type deploy — assert "registered worker" appears within 90s
Manual or scripted call to a demo line — assert agent greets caller (proves dispatch working)
(Future, G5) automated outbound-dial to Telnyx echo number — assert participant joins room within 30s
If any check fails: DO NOT declare deploy successful. Re-run lk agent deploy (livekit/agents#3104 fix pattern). If failure persists after two deploys, escalate to founder. Reference: memory/feedback_livekit_recovery_lk_deploy_only.md.

Daily (crons)

Cron	File	Cadence	What it checks
`cron-voice-health`	`src/app/api/cron/voice-health/route.ts`	Every 15 min	LiveKit inbound trunk config, dispatch rules, agent_name
Telnyx carrier state extension (G4)	extend `voice-health/route.ts`	Every 15 min	Telnyx `outbound_voice_profile_id` bound, DID-to-connection binding ⚠️ not yet implemented
Daily outbound-dial cert (G5)	`src/app/api/cron/voice-outbound-cert/route.ts`	Daily	Real dial to Telnyx echo number, assert room join ⚠️ not yet implemented
`voice-behavioral-nightly-church.yml`	`.github/workflows/voice-behavioral-nightly-church.yml`	Nightly 06:00 UTC	Church vertical behavioral suite (Haiku judge)

Weekly (scheduled)

voice-behavioral-funeral.yml — funeral vertical behavioral scenarios
voice-clients-drift.yml — voice-clients YAML drift detection

Manual (on trigger)

Full 10-item founder-supervised live verification — before any cold-email batch GO/NO-GO
Crisis pathway live test (item 5 in 06-DAY3-HANDOFF.md §6) — call demo line, say crisis phrase, assert 988 routing + DB row + no SMS
Regression across all 4 customer lines (item 6) — verify each answers correctly

Critical-path registry entries (existing, `tests/registry.yaml`)

voice-live-call — critical_path: true, spec_file: null ⚠️ spec not yet authored (the Playwright round-trip G1 will close this)
voice-routing-integration — critical_path: true, spec_file: null — covered by pytest workflow (not Playwright)
voice-behavioral-nightly — critical_path: false, nightly behavioral suite

§6 — Acceptance Criteria — When is the Voice Agent "Hardened"?

The founder uses this checklist to make the cold-email batch GO/NO-GO call. Every item must be provably GREEN before the call. "Provably" means an artifact (PR link, CI run link, file path, SQL query result) that a human or agent can inspect.

§7 — How to Use This Document

Before touching voice code: Read §1 to understand which layer you are working in. Read §2 to know what failure modes have already burned this project in that layer. If your change touches layers that have ⚠️ marks in §3, you must either build the missing test as part of your PR (using §4 priority and file path), or carry a critical-path-override label with a documented reason.

Before opening a PR: Check §3 for every layer your PR touches. If that layer's test status is "static-only" or "⚠️ none," your PR must include the corresponding §4 gap closure OR an explicit waiver. The voice-critical-path-gate workflow (proposed) will enforce this for the highest-priority gaps once implemented.

Before merging a critical-path voice PR: §5 cadence defines which tests must pass. The minimum bar is:

voice-tool-schemas.yml green
voice-routing-integration-on-pr.yml green
cold-outreach-director-transfer.yml green (once spec exists — G1)
No LIFE-SAFETY test failures (test_escalation_routing.py, test_transfer_crisis_gate.py)

Before founder approves a voice-related ship: Walk §6 acceptance criteria. Each item must have an artifact. "Looks good" and "build passes" are not artifacts.

The 8-P0 heuristic: If your PR changes behavior at Layer 1-11 but tests only Layers 4-8 with stubs, you are shipping a PR like PR #251. The specific question to ask before merge: "Is there at least one test that will fail if I introduce a regression at Layer 1, 2, 3, 9, 10, or 11?" If the answer is no, do not merge.

§8 — Living-Document Protocol

This document is updated whenever:

A new P0 is found in production: Add a row to §2 table with the layer mapping. Immediately assess which §3 entry should have caught it and move to §4 with P0 priority.
A new test lands: Move the corresponding entry from §4 to §3. Update §3 with the file path, "on main" status, and any layer limitations. Remove the ⚠️ from §3 entries the test now covers.
A new layer is added to the stack: Add a definition to §1 (renumber if needed). Add a §3 entry. Add a §4 gap if the layer is untested.
The worktree tests (G2, G3) merge to main: Update §3 Layer 4 + Layer 8 entries to remove the "in worktree" qualifier.
Acceptance criteria (§6) items are met: Check the box and add the artifact link.

Owner: The orchestrator on each Day-N session is responsible for updating this document before ending the session if any §2, §3, or §6 item changed.

Snapshot freshness: The last-verified frontmatter field is updated when §3 is re-confirmed against actual code in voice-agent-livekit/tests/ and .github/workflows/. Current last-verified: 2026-04-30 reflects the state of the worktree at the start of Day 4.

The core principle (from memory/feedback_round_trip_test_before_merge.md):

Any PR that ships a customer-facing browser demo, live transfer mechanic, WebRTC↔SIP bridge, or anything where the integration spans browser mic → LiveKit Cloud → agent runtime → STT → LLM → tool call → carrier → callee → bridge MUST include AT LEAST ONE end-to-end Playwright spec that exercises the real round-trip. Stubbed unit tests are insufficient.

This is not a preference. It is the lesson written in 7 hours of live debugging and 8 P0 regressions on a day when the founder said: "The better the demos and the more robust the product, the conversions will be way higher so it's worth spending a few days on it to get it right."

Preamble — Why This Document Exists​

§1 — The 11 Layers (with definitions and concrete examples)​

Layer 1 — Browser DOM + getUserMedia​

Layer 2 — Mic input → WebRTC track publish​

Layer 3 — LiveKit Cloud room + signaling​

Layer 4 — Agent runtime / dispatcher (Python)​

Layer 5 — STT (Deepgram via LiveKit plugin)​

Layer 6 — LLM (Anthropic / Gemini / Groq disabled)​

Layer 7 — Tool call registration + invocation​

Layer 8 — SIP outbound API (CreateSIPParticipant, TransferSIPParticipant)​

Layer 9 — Carrier (Telnyx / Twilio)​

Layer 10 — Callee (PSTN ringer reaching real phone)​

Layer 11 — Audio bridge (REFER vs room-native mixing)​

§2 — Failure-Mode Catalog​

§3 — Existing Test Surface (current state, 2026-04-30)​

Layer 1 — Browser DOM + getUserMedia​

Layer 2 — Mic input → WebRTC track publish​

Layer 3 — LiveKit Cloud room + signaling​

Layer 4 — Agent runtime / dispatcher (Python)​

Layer 5 — STT (Deepgram)​

Layer 6 — LLM tool schema​

Layer 7 — Tool call registration + invocation​

Layer 8 — SIP outbound API​

Layer 9 — Carrier (Telnyx / Twilio)​

Layer 10 — Callee (PSTN ringer)​

Layer 11 — Audio bridge​

Cross-layer behavioral tests​

§4 — Gap Closure Roadmap​

G1 — Round-trip Playwright spec cold-outreach-director-transfer.spec.ts​

G2 — Merge worktree tests to main: test_function_tool_schemas.py + voice-tool-schemas.yml​

G3 — Merge worktree test to main: test_transfer_sip_payload_shape.py​

G4 — Voice-health cron Telnyx carrier config extension​

G5 — Daily outbound-trunk dial certification cron​

G6 — Voice agent boot smoke (post-deploy health assertion)​

G7 — WebRTC↔SIP bridge branch test (test_transfer_browser_branch.py)​

G8 — Per-agent tool inventory contract test​

G9 — Crisis pathway end-to-end test​

G10 — Multi-tenant routing test (all 4 production lines)​

G11 — LLM fallback chain test​

G12 — Inbound trunk lock test (CI-blocking)​

G13 — Cartesia voice_id format validation​

G14 — Self-dial loop detection test​

G15 — STT keyterms boost test​

G16 — Multi-agent (Coordinator → Care) handoff regression test​

G17 — demo_dial_log count integrity test​

§5 — Test Cadence + Ownership​

On every PR (gate — blocks merge)​

On every voice agent deploy (post-deploy smoke — within 90s of lk agent deploy)​

Daily (crons)​

Weekly (scheduled)​

Manual (on trigger)​

Critical-path registry entries (existing, tests/registry.yaml)​

§6 — Acceptance Criteria — When is the Voice Agent "Hardened"?​

§7 — How to Use This Document​

§8 — Living-Document Protocol​

Preamble — Why This Document Exists

§1 — The 11 Layers (with definitions and concrete examples)

Layer 1 — Browser DOM + getUserMedia

Layer 2 — Mic input → WebRTC track publish

Layer 3 — LiveKit Cloud room + signaling

Layer 4 — Agent runtime / dispatcher (Python)

Layer 5 — STT (Deepgram via LiveKit plugin)

Layer 6 — LLM (Anthropic / Gemini / Groq disabled)

Layer 7 — Tool call registration + invocation

Layer 8 — SIP outbound API (`CreateSIPParticipant`, `TransferSIPParticipant`)

Layer 9 — Carrier (Telnyx / Twilio)

Layer 10 — Callee (PSTN ringer reaching real phone)

Layer 11 — Audio bridge (REFER vs room-native mixing)

§2 — Failure-Mode Catalog

§3 — Existing Test Surface (current state, 2026-04-30)

Layer 1 — Browser DOM + getUserMedia

Layer 2 — Mic input → WebRTC track publish

Layer 3 — LiveKit Cloud room + signaling

Layer 4 — Agent runtime / dispatcher (Python)

Layer 5 — STT (Deepgram)

Layer 6 — LLM tool schema

Layer 7 — Tool call registration + invocation

Layer 8 — SIP outbound API

Layer 9 — Carrier (Telnyx / Twilio)

Layer 10 — Callee (PSTN ringer)

Layer 11 — Audio bridge

Cross-layer behavioral tests

§4 — Gap Closure Roadmap

G1 — Round-trip Playwright spec `cold-outreach-director-transfer.spec.ts`

G2 — Merge worktree tests to main: `test_function_tool_schemas.py` + `voice-tool-schemas.yml`

G3 — Merge worktree test to main: `test_transfer_sip_payload_shape.py`

G4 — Voice-health cron Telnyx carrier config extension

G5 — Daily outbound-trunk dial certification cron

G6 — Voice agent boot smoke (post-deploy health assertion)

G7 — WebRTC↔SIP bridge branch test (`test_transfer_browser_branch.py`)

G8 — Per-agent tool inventory contract test

G9 — Crisis pathway end-to-end test

G10 — Multi-tenant routing test (all 4 production lines)

G11 — LLM fallback chain test

G12 — Inbound trunk lock test (CI-blocking)

G13 — Cartesia voice_id format validation

G14 — Self-dial loop detection test

G15 — STT keyterms boost test

G16 — Multi-agent (Coordinator → Care) handoff regression test

G17 — `demo_dial_log` count integrity test

§5 — Test Cadence + Ownership

On every PR (gate — blocks merge)

On every voice agent deploy (post-deploy smoke — within 90s of `lk agent deploy`)

Daily (crons)

Weekly (scheduled)

Manual (on trigger)

Critical-path registry entries (existing, `tests/registry.yaml`)

§6 — Acceptance Criteria — When is the Voice Agent "Hardened"?

§7 — How to Use This Document

§8 — Living-Document Protocol