Skip to main content

Knowledge > Runbooks > Technical Ops > Triage and Resolve Error Reports

Triage and Resolve Error Reports

Work through the ops_error_reports table to identify, prioritize, and resolve production errors.

Prerequisites

  • Access to Supabase SQL editor or mcp__plugin_supabase_supabase__execute_sql MCP tool
  • Access to Vercel logs for stack traces
  • The relevant codebase checked out on a feature branch

Severity Definitions

SeverityMeaningSLA
P0Critical — Stripe, chatbot, admin routes down; quota/rate-limit hitFix immediately
P1Important — Non-critical route errors, degraded functionalityFix same day
P2Minor — Cosmetic, low-impact errorsFix in next PR

P0 routes: /api/stripe/, /api/onboard/, /api/chatbot/stream, /api/admin/

Steps

  1. Pull unresolved P0 errors

    SELECT
    id,
    route,
    message,
    message_fingerprint,
    severity,
    created_at,
    count(*) OVER (PARTITION BY message_fingerprint) as occurrences
    FROM ops_error_reports
    WHERE severity = 'P0'
    AND resolved_at IS NULL
    ORDER BY created_at DESC
    LIMIT 20;
  2. Group by fingerprint to find the highest-impact errors

    SELECT
    message_fingerprint,
    route,
    message,
    count(*) as occurrences,
    min(created_at) as first_seen,
    max(created_at) as last_seen
    FROM ops_error_reports
    WHERE resolved_at IS NULL
    GROUP BY message_fingerprint, route, message
    ORDER BY occurrences DESC
    LIMIT 20;
  3. Check Vercel logs for the full stack trace

    vercel logs --tail

    Or filter to a specific function in Vercel Dashboard → Project → Functions → select the route.

    For historical logs, search by time range in Vercel Dashboard → Logs.

  4. Diagnose common error patterns

    Stripe webhook errors (route: /api/stripe/webhook):

    Chatbot errors (route: /api/chatbot/stream):

    • Rate limit → check ops_quota_snapshots for OpenAI/Anthropic quota
    • Embedding failure → check unified_rag_content for null embeddings
    • LLM API error → check API key env vars in Vercel

    Admin route errors (route: /api/admin/):

    • Auth failure → check church_admin_sessions for expired sessions
    • DB query timeout → see db-performance.md

    Quota/rate-limit errors (any route):

    SELECT service, metric_name, metric_value, recorded_at
    FROM ops_quota_snapshots
    ORDER BY recorded_at DESC LIMIT 10;
  5. Fix the root cause

    Create a feature branch, fix the code, run pnpm build to confirm no TypeScript errors, deploy.

    For env var issues (no code change needed):

    echo "correct-value" | vercel env add VAR_NAME production
    vercel --prod # redeploy to pick up new env var
  6. Mark errors resolved

    After deploying the fix and verifying it works, mark the resolved fingerprint(s):

    UPDATE ops_error_reports
    SET resolved_at = now(), resolution_note = 'Fixed by: [brief description]'
    WHERE message_fingerprint = 'abc123def456'
    AND resolved_at IS NULL;

    Or mark all errors for a specific route resolved after a confirmed fix:

    -- Only do this when you are CERTAIN the fix covers all errors for this route
    UPDATE ops_error_reports
    SET resolved_at = now(), resolution_note = 'Deployed fix for [issue] on 2026-03-25'
    WHERE route = '/api/chatbot/stream'
    AND severity = 'P1'
    AND resolved_at IS NULL
    AND created_at < '2026-03-25T12:00:00Z';
  7. Check ops_quota_snapshots for service degradation

    If errors correlate with a service quota breach:

    SELECT service, metric_name, metric_value, threshold, recorded_at
    FROM ops_quota_snapshots
    WHERE metric_value > threshold -- adjust column names per actual schema
    ORDER BY recorded_at DESC LIMIT 10;

    For Twilio balance issues: top up the account before retrying failed voice calls. For Resend quota: check plan limits and upgrade if needed.

Verification

  • SELECT count(*) FROM ops_error_reports WHERE severity='P0' AND resolved_at IS NULL returns 0
  • The fixed route returns 200 responses in Vercel logs
  • No new P0 errors appear within 15 minutes of the fix (one monitoring cycle)

See Also