Skip to main content

Knowledge > Products > PewSearch Directory > Data Quality

PewSearch Data Quality

Why Data Quality Matters

PewSearch displays 218K+ church listings to the public. Every listing represents a real church where real people worship. Data quality directly affects:

  • User trust: Wrong addresses, closed churches, or restaurant photos destroy credibility
  • SEO authority: Google penalizes directories with stale or inaccurate information
  • Conversion: A pastor who sees their church listed with wrong hours will not claim the listing
  • Downstream products: Voice Agent and Chatbot inherit church data -- bad data means bad AI responses

This document catalogs known data quality issues, their scope, and the mitigation strategies in place.

Known Data Quality Issues

1. Address-Only-State Records (~128K rows)

Severity: High Scope: ~128K churches have address containing only the state abbreviation (e.g., "TX", "California")

These records were imported from OpenStreetMap where the address field was populated with only the state. The church exists and has a valid name, coordinates, and often a denomination, but the address is unusable for display or directions.

Detection:

-- Count address-only-state records
SELECT COUNT(*) FROM churches
WHERE directory_visible = true
AND business_status = 'OPERATIONAL'
AND (
LENGTH(address) <= 3
OR address = state
OR address = state_code
);

Mitigation: address-utils.ts contains isDisplayableAddress():

pseudocode: isDisplayableAddress(address, state, state_code)
if address is null or empty:
return false
if address.trim().length <= 3:
return false // "TX", "CA", etc.
if address.trim() == state or address.trim() == state_code:
return false // "Texas", "TX"
if address matches pattern /^[A-Z]{2}$/i:
return false // Any two-letter abbreviation
return true

When isDisplayableAddress() returns false, the UI shows "Address not available" instead of the misleading state-only value. The church still appears in search results (it has valid coordinates for map display).

2. Non-Church Business Photos

Severity: Medium Scope: Unknown (estimated hundreds to low thousands)

Google Maps data sometimes associates photos from nearby businesses with church listings. Known cases include restaurant interiors, park landscapes, dispensary storefronts, and gas station signs appearing as the photo_url for churches.

Root cause: Outscraper/Google Maps API returns the most prominent photo for a Google Maps place ID. When the place ID is slightly wrong or the business has been recategorized, the photo may be from a different business.

Detection: Manual review only. No automated detection is in place.

Mitigation strategies:

  • Premium churches upload their own photos (overrides scraped photo)
  • Category-based filtering during import (exclude categories like "restaurant", "gas_station")
  • Community reports via contact form
  • Future: AI-based photo classification to flag non-church images

3. Missing Service Hours (~20% of listings)

Severity: Medium Scope: ~44K visible churches have NULL or empty working_hours

Many churches -- especially smaller congregations and non-denominational churches -- do not have hours listed on Google Maps. OpenStreetMap rarely includes hours data.

Impact: The church detail page shows "Hours not available" and cannot display a "Next Service" highlight. This reduces the page's usefulness and SEO value.

Mitigation:

  • Premium churches set their own hours via admin dashboard (custom_hours)
  • Website scraping extracts hours from church websites when available
  • Google Maps data refresh periodically adds hours for churches that update their Google listing

4. Missing Denominations (~15% of listings)

Severity: Low-Medium Scope: ~33K visible churches have NULL denomination

Non-denominational churches intentionally omit denomination, but many denominational churches also have NULL denomination due to incomplete data sources.

Impact: These churches do not appear in denomination-filtered searches. They can still be found by name, location, or text search.

Mitigation:

  • Website scraping attempts to extract denomination from "about" pages
  • Denomination-to-name heuristics (e.g., "First Baptist Church of Dallas" → "Baptist")
  • Premium churches set denomination during claim flow
  • Community submissions

5. Duplicate Entries

Severity: Medium Scope: Estimated 2-5K duplicate pairs

The same physical church can appear multiple times under:

  • Different names ("Grace Baptist Church" vs "Grace Baptist")
  • Different data sources (one from OSM, one from Google Maps)
  • Name changes (old name still in database alongside new name)
  • Multi-campus churches (main campus and satellite listed separately)

Detection:

-- Find potential duplicates by proximity + similar name
SELECT a.id, a.name, b.id, b.name,
haversine_distance(a.latitude, a.longitude, b.latitude, b.longitude) as dist_miles
FROM churches a
JOIN churches b ON a.id < b.id
WHERE a.directory_visible = true
AND b.directory_visible = true
AND haversine_distance(a.latitude, a.longitude, b.latitude, b.longitude) < 0.1
AND similarity(a.name, b.name) > 0.6
LIMIT 100;

Mitigation:

  • Deduplication script runs periodically (merge lower-quality record into higher-quality)
  • directory_visible = false hides identified duplicates without deleting data
  • Manual review for high-profile churches

6. Permanently Closed Churches

Severity: Low Scope: Tracked via business_status field

Churches that have permanently closed are marked business_status = 'CLOSED_PERMANENTLY'. The mandatory query filter business_status = 'OPERATIONAL' excludes these from all directory views.

Detection: Google Maps periodically marks businesses as closed. Our data refresh picks up these status changes.

Mitigation: Already handled by the mandatory query filter. No user-visible impact.

7. Incorrect Coordinates

Severity: Low-Medium Scope: Unknown (estimated <1%)

Some churches have latitude/longitude that places them in the wrong location (sometimes in a different state). This affects map display and "nearby churches" results.

Detection:

-- Find churches where coordinates don't match their state
SELECT id, name, state_code, latitude, longitude
FROM churches
WHERE directory_visible = true
AND latitude IS NOT NULL
AND NOT ST_Contains(
(SELECT geom FROM us_states WHERE state_code = churches.state_code),
ST_MakePoint(longitude, latitude)
)
LIMIT 50;

Mitigation:

  • Cross-reference coordinates with state boundaries
  • Geocoding validation during import
  • Premium churches can correct their coordinates via admin

Import Pipeline Quality Gates

When new data is imported (from any source), the following quality gates apply:

pseudocode: importQualityGates(church_record)
// Gate 1: Required fields
REQUIRE: name is not null and not empty
REQUIRE: state_code is valid US state/territory
REQUIRE: latitude and longitude are within US bounds
latitude: 18.0 to 72.0 (includes territories)
longitude: -180.0 to -65.0 (includes Alaska, territories)

// Gate 2: Category filtering
EXCLUDE if category in:
["restaurant", "gas_station", "convenience_store",
"bar", "liquor_store", "cannabis_dispensary",
"night_club", "adult_entertainment", "casino",
"pawn_shop", "tattoo_parlor"]

// Gate 3: Name filtering
EXCLUDE if name matches patterns:
/mosque|synagogue|temple|gurdwara|masjid/i
(PewSearch is churches-only; other faith directories are separate)

// Gate 4: Deduplication check
CHECK for existing church with:
same state_code AND
(similar name OR same coordinates within 0.05 miles)
If duplicate found: merge data (keep higher-quality fields), do not insert

// Gate 5: Default values
SET directory_visible = true
SET business_status = 'OPERATIONAL'

return PASS / FAIL with reason

Content Enrichment Pipeline

The enrichment pipeline improves existing records by scraping church websites:

pseudocode: enrichChurch(church)
if church.website is null:
return // Nothing to scrape

if church.website_scraped_at is recent (< 90 days):
return // Already enriched recently

// Scrape website
content = scrapeWebsite(church.website)

// Extract structured data
extractedHours = parseHours(content)
extractedStaff = parseStaff(content)
extractedDenomination = parseDenomination(content)
extractedDescription = parseDescription(content)

// Update church record (only fill gaps, never overwrite existing good data)
if church.working_hours is null AND extractedHours is valid:
UPDATE church SET working_hours = extractedHours

if church.denomination is null AND extractedDenomination is valid:
UPDATE church SET denomination = extractedDenomination

if church.description is null AND extractedDescription is valid:
UPDATE church SET description = extractedDescription

UPDATE church SET website_scraped_at = now()

Scraper Exclusion Rules

Certain websites and patterns are excluded from scraping (see memory: feedback_scraper_exclusions):

  • Websites behind authentication walls
  • Websites with robots.txt disallow
  • Websites that return CAPTCHA or anti-bot pages
  • Known template platforms that don't contain unique content
  • Rate limiting: max 1 request per second per domain

Data Quality Metrics

Key metrics to monitor (queryable from Supabase):

MetricQueryTarget
Total visible churchesSELECT COUNT(*) FROM churches WHERE directory_visible=true AND business_status='OPERATIONAL'218K+
Missing addressSELECT COUNT(*) FROM churches WHERE directory_visible=true AND NOT isDisplayableAddress(address)< 130K
Missing hoursSELECT COUNT(*) FROM churches WHERE directory_visible=true AND working_hours IS NULL< 45K
Missing denominationSELECT COUNT(*) FROM churches WHERE directory_visible=true AND denomination IS NULL< 35K
Missing photoSELECT COUNT(*) FROM churches WHERE directory_visible=true AND photo_url IS NULL< 100K
Missing phoneSELECT COUNT(*) FROM churches WHERE directory_visible=true AND phone IS NULL< 80K
Missing websiteSELECT COUNT(*) FROM churches WHERE directory_visible=true AND website IS NULL< 100K
Has coordinatesSELECT COUNT(*) FROM churches WHERE directory_visible=true AND latitude IS NOT NULL> 200K

Prioritization: What to Fix First

When allocating resources to data quality improvements:

PriorityIssueRationale
P0Closed churches still visibleDestroys trust immediately
P1Non-church businesses in directoryMisleading, hurts SEO authority
P1Premium church data inaccuratePaying customers see wrong info
P2Missing hours for high-traffic churchesReduces conversion for best candidates
P2Duplicate entries in same cityConfusing search results
P3Address-only-state displayMitigated by isDisplayableAddress()
P3Missing denominationPartial searches still work
P4Missing photosNice-to-have, not trust-breaking

See Also