Knowledge > Processes > Content Generation Pipeline

Content Generation Pipeline

How sermon illustrations are generated, enriched, embedded, and surfaced in the IllustrateTheWord directory (327K+ records in unified_rag_content).

Two Content Paths

Content enters unified_rag_content through two paths: public-domain scraping and AI generation. Both converge on the same insertion and embedding pipeline.

Path A: Public-Domain Scraping (Python)

Historical illustrations scraped from Archive.org sources (Biblical Illustrator, Spurgeon, Maclaren). Uses claude -p CLI for text cleanup, NOT API calls.

1. SCRAPER SELECTS SOURCE
   scraper = BiblicalIllustratorScraper(source_config)
   # source_config defines: author, pub_year, content_type, theological_lens_id

2. PARSE VOLUME
   for item in scraper.parse_volume(volume_id):
       # item is a ScrapedItem: raw_text, book, chapter, verse, verse_quote
       yield item

3. AI CLEANUP (Claude CLI — NOT API)
   cleaned = scraper._call_claude_cli(prompt)
   # Spawns: claude -p "Clean up this OCR text..."
   # ENV VARS STRIPPED: CLAUDE_CODE_ENTRYPOINT, CLAUDECODE deleted
   #   (required for nested CLI invocation from scripts)
   # COST: $0 — founder pays $200/mo for Claude Max

4. BUILD ProcessedItem
   item = ProcessedItem(
       id=uuid4(),
       slug=slugify(title) + "-" + id[:8],
       title, content, summary, teaser,
       word_count=len(content.split()),
       content_type="historical_illustration",
       source_type="ai_generated",
       scripture_references=[make_scripture_ref(book, ch, v_start, v_end)],
       theological_lens_id=0,  # Universal (shows in all traditions)
       is_universal=True,
       topics, themes,
       primary_author, primary_source, source_attribution,
       quality_score=simple_quality_score(content, refs),
       visibility_tier="free_signup",
       curation_status="approved",
   )

5. QUALITY SCORING
   score = simple_quality_score(content, refs)
   # Baseline: 0.70
   # Penalties: word_count < 80 (-0.25), banned phrases (-0.15),
   #   God names lowercase (-0.10)
   # Bonuses: 150-280 words (+0.05), 5+ proper nouns (+0.05),
   #   3+ theological terms (+0.05), has scripture refs (+0.05)

6. DUPLICATE CHECK (two-level)
   IF (scripture_ref, primary_source) in session_seen_set:
       SKIP  # In-memory dedup (fast)
   IF db.count(scripture_refs=refs, primary_source=source) > 0:
       SKIP  # DB dedup (authoritative)

7. INSERT INTO unified_rag_content
   db_writer.write(item)
   # content_category derived from content_type via mapping dict
   # Sets created_at and updated_at to now()

Path B: AI Generation (Node.js scripts)

Six-phase pipeline generating new illustrations. All use claude -p CLI via generateWithClaudeMax() from scripts/lib/shared.mjs.

Phase 1: REGENERATE STUBS (regenerate-stubs.mjs)
   Read stubs from unified_rag_content WHERE word_count <= 30
   FOR each stub:
       prompt = existing metadata (topics, themes, scripture) as context
       new_content = claude -p "Generate illustration..."
       UPDATE unified_rag_content SET content, summary, teaser,
           word_count, quality_score, embedding, embedding_model

Phase 2: GENERATE BY SCRIPTURE (generate-by-scripture.mjs)
   Target: popular passages, lectionary readings, book gaps
   Generate new illustrations for underserved scripture references

Phase 3: GENERATE BY TOPIC (generate-by-topic.mjs)
   Target: underserved topics x source categories
   Fill coverage gaps across topic taxonomy

Phase 4: GENERATE LENS CONTENT (generate-lens-content.mjs)
   Target: tradition-specific illustrations for each of 17 lenses
   Each illustration tagged with specific theological_lens_id

Phase 6: GENERATE IMAGES (generate-illustration-images.mjs)
   DALL-E image generation per illustration
   STRICT RULES (from content-rules.md):
   - NEVER depict God, Jesus's face, or any deity
   - Jesus only from behind, silhouette, or at distance
   - No non-Christian religious symbols or architecture
   - No nudity, no meditation poses, no text in images
   - Always include AI disclosure in alt text

Embedding Generation

Both paths generate embeddings using the same model and format.

MODEL: text-embedding-3-small (OpenAI API)
DIMENSIONS: 1536
TEXT FORMAT: "Scripture: {ref}\n\nAuthor: {author}\n\nSource: {source}\n\nContent: {content}"
COLUMN: embedding (vector(1536))
TRACKING: embedding_model column per row

NOTE: Embeddings still use OpenAI API (no CLI alternative).
      Use --skip-embeddings flag to defer embedding generation.

CRITICAL: If embedding model ever changes, ALL embeddings must be
regenerated together. Mixed embedding spaces break vector search.

View Read Layer

After content is inserted, it becomes visible in the directory immediately through a live SQL view.

SOURCE TABLE: public.unified_rag_content (327K+ rows)
                    |
                    v  (live — no refresh needed)
VIEW: dir_illustrations (regular SQL view, ~50K rows)
   - Filters: content_category = 'illustration', is_active = true
   - Includes 26 content types
   - Includes structured data fields

NOTE: dir_illustrations is NOT a materialized view. Content appears immediately.
      If rows are missing, check unified_rag_content.is_active and curation_status.

Content Quality Rules (content-rules.md)

WORD COUNTS:
   Standard illustrations: 180-280 words
   Never under 100 words (stubs)
   Commentary (churchwiseai_commentary): 300-500 words

BANNED PHRASES (quality_score penalty):
   "Consider how [scripture] speaks to [topic]"
   "In a world where..."
   "A story that demonstrates..."
   "This modern example reminds us..."
   Any [template brackets] leftover

GOD NAMES: Always capitalized (Yahweh, Jehovah, Elohim, Adonai, El Shaddai)
FOREIGN WORDS: Wrapped in *asterisks* with English meaning
TITLES: Wrapped in *asterisks* (books, movies, songs)

VISIBILITY TIERS (assigned by assign-visibility-tiers.mjs):
   public (~18%): historical_illustration + top ~50 per non-premium type
   free_signup (~68%): majority of library, requires free account
   premium (~14%): premium content types, requires $9.95/mo subscription

DATABASE RULES:
   Always update embedding, summary, teaser, word_count, quality_score
     when changing content
   Never delete rows
   Never change source_type, content_type, content_category, content_format
   Use service role key (bypasses RLS)
   Run in small batches with rate limiting

Key Constraint

ALWAYS use claude -p (Claude CLI) for content generation, NEVER use Anthropic/OpenAI APIs. The founder pays $200/mo for Claude Max. API calls should only be used for real-time product features (chatbot, voice agent) and for embedding generation (no CLI alternative).

Two Content Paths​

Path A: Public-Domain Scraping (Python)​

Path B: AI Generation (Node.js scripts)​

Embedding Generation​

View Read Layer​

Content Quality Rules (content-rules.md)​

Key Constraint​