Knowledge > Runbooks > Technical Ops > Triage and Resolve Error Reports
Triage and Resolve Error Reports
Work through the ops_error_reports table to identify, prioritize, and resolve production errors.
Prerequisites
- Access to Supabase SQL editor or
mcp__plugin_supabase_supabase__execute_sqlMCP tool - Access to Vercel logs for stack traces
- The relevant codebase checked out on a feature branch
Severity Definitions
| Severity | Meaning | SLA |
|---|---|---|
| P0 | Critical — Stripe, chatbot, admin routes down; quota/rate-limit hit | Fix immediately |
| P1 | Important — Non-critical route errors, degraded functionality | Fix same day |
| P2 | Minor — Cosmetic, low-impact errors | Fix in next PR |
P0 routes: /api/stripe/, /api/onboard/, /api/chatbot/stream, /api/admin/
Steps
-
Pull unresolved P0 errors
SELECTid,route,message,message_fingerprint,severity,created_at,count(*) OVER (PARTITION BY message_fingerprint) as occurrencesFROM ops_error_reportsWHERE severity = 'P0'AND resolved_at IS NULLORDER BY created_at DESCLIMIT 20; -
Group by fingerprint to find the highest-impact errors
SELECTmessage_fingerprint,route,message,count(*) as occurrences,min(created_at) as first_seen,max(created_at) as last_seenFROM ops_error_reportsWHERE resolved_at IS NULLGROUP BY message_fingerprint, route, messageORDER BY occurrences DESCLIMIT 20; -
Check Vercel logs for the full stack trace
vercel logs --tailOr filter to a specific function in Vercel Dashboard → Project → Functions → select the route.
For historical logs, search by time range in Vercel Dashboard → Logs.
-
Diagnose common error patterns
Stripe webhook errors (route:
/api/stripe/webhook):- Signature mismatch → check
STRIPE_WEBHOOK_SECRETenv var - See stripe-webhook-debug.md
Chatbot errors (route:
/api/chatbot/stream):- Rate limit → check
ops_quota_snapshotsfor OpenAI/Anthropic quota - Embedding failure → check
unified_rag_contentfor null embeddings - LLM API error → check API key env vars in Vercel
Admin route errors (route:
/api/admin/):- Auth failure → check
church_admin_sessionsfor expired sessions - DB query timeout → see db-performance.md
Quota/rate-limit errors (any route):
SELECT service, metric_name, metric_value, recorded_atFROM ops_quota_snapshotsORDER BY recorded_at DESC LIMIT 10; - Signature mismatch → check
-
Fix the root cause
Create a feature branch, fix the code, run
pnpm buildto confirm no TypeScript errors, deploy.For env var issues (no code change needed):
echo "correct-value" | vercel env add VAR_NAME productionvercel --prod # redeploy to pick up new env var -
Mark errors resolved
After deploying the fix and verifying it works, mark the resolved fingerprint(s):
UPDATE ops_error_reportsSET resolved_at = now(), resolution_note = 'Fixed by: [brief description]'WHERE message_fingerprint = 'abc123def456'AND resolved_at IS NULL;Or mark all errors for a specific route resolved after a confirmed fix:
-- Only do this when you are CERTAIN the fix covers all errors for this routeUPDATE ops_error_reportsSET resolved_at = now(), resolution_note = 'Deployed fix for [issue] on 2026-03-25'WHERE route = '/api/chatbot/stream'AND severity = 'P1'AND resolved_at IS NULLAND created_at < '2026-03-25T12:00:00Z'; -
Check
ops_quota_snapshotsfor service degradationIf errors correlate with a service quota breach:
SELECT service, metric_name, metric_value, threshold, recorded_atFROM ops_quota_snapshotsWHERE metric_value > threshold -- adjust column names per actual schemaORDER BY recorded_at DESC LIMIT 10;For Twilio balance issues: top up the account before retrying failed voice calls. For Resend quota: check plan limits and upgrade if needed.
Verification
SELECT count(*) FROM ops_error_reports WHERE severity='P0' AND resolved_at IS NULLreturns 0- The fixed route returns 200 responses in Vercel logs
- No new P0 errors appear within 15 minutes of the fix (one monitoring cycle)