Knowledge > Processes > Content Generation Pipeline
Content Generation Pipeline
How sermon illustrations are generated, enriched, embedded, and surfaced in the IllustrateTheWord directory (327K+ records in unified_rag_content).
Two Content Paths
Content enters unified_rag_content through two paths: public-domain scraping and AI generation. Both converge on the same insertion and embedding pipeline.
Path A: Public-Domain Scraping (Python)
Historical illustrations scraped from Archive.org sources (Biblical Illustrator, Spurgeon, Maclaren). Uses claude -p CLI for text cleanup, NOT API calls.
1. SCRAPER SELECTS SOURCE
scraper = BiblicalIllustratorScraper(source_config)
# source_config defines: author, pub_year, content_type, theological_lens_id
2. PARSE VOLUME
for item in scraper.parse_volume(volume_id):
# item is a ScrapedItem: raw_text, book, chapter, verse, verse_quote
yield item
3. AI CLEANUP (Claude CLI — NOT API)
cleaned = scraper._call_claude_cli(prompt)
# Spawns: claude -p "Clean up this OCR text..."
# ENV VARS STRIPPED: CLAUDE_CODE_ENTRYPOINT, CLAUDECODE deleted
# (required for nested CLI invocation from scripts)
# COST: $0 — founder pays $200/mo for Claude Max
4. BUILD ProcessedItem
item = ProcessedItem(
id=uuid4(),
slug=slugify(title) + "-" + id[:8],
title, content, summary, teaser,
word_count=len(content.split()),
content_type="historical_illustration",
source_type="ai_generated",
scripture_references=[make_scripture_ref(book, ch, v_start, v_end)],
theological_lens_id=0, # Universal (shows in all traditions)
is_universal=True,
topics, themes,
primary_author, primary_source, source_attribution,
quality_score=simple_quality_score(content, refs),
visibility_tier="free_signup",
curation_status="approved",
)
5. QUALITY SCORING
score = simple_quality_score(content, refs)
# Baseline: 0.70
# Penalties: word_count < 80 (-0.25), banned phrases (-0.15),
# God names lowercase (-0.10)
# Bonuses: 150-280 words (+0.05), 5+ proper nouns (+0.05),
# 3+ theological terms (+0.05), has scripture refs (+0.05)
6. DUPLICATE CHECK (two-level)
IF (scripture_ref, primary_source) in session_seen_set:
SKIP # In-memory dedup (fast)
IF db.count(scripture_refs=refs, primary_source=source) > 0:
SKIP # DB dedup (authoritative)
7. INSERT INTO unified_rag_content
db_writer.write(item)
# content_category derived from content_type via mapping dict
# Sets created_at and updated_at to now()
Path B: AI Generation (Node.js scripts)
Six-phase pipeline generating new illustrations. All use claude -p CLI via generateWithClaudeMax() from scripts/lib/shared.mjs.
Phase 1: REGENERATE STUBS (regenerate-stubs.mjs)
Read stubs from unified_rag_content WHERE word_count <= 30
FOR each stub:
prompt = existing metadata (topics, themes, scripture) as context
new_content = claude -p "Generate illustration..."
UPDATE unified_rag_content SET content, summary, teaser,
word_count, quality_score, embedding, embedding_model
Phase 2: GENERATE BY SCRIPTURE (generate-by-scripture.mjs)
Target: popular passages, lectionary readings, book gaps
Generate new illustrations for underserved scripture references
Phase 3: GENERATE BY TOPIC (generate-by-topic.mjs)
Target: underserved topics x source categories
Fill coverage gaps across topic taxonomy
Phase 4: GENERATE LENS CONTENT (generate-lens-content.mjs)
Target: tradition-specific illustrations for each of 17 lenses
Each illustration tagged with specific theological_lens_id
Phase 6: GENERATE IMAGES (generate-illustration-images.mjs)
DALL-E image generation per illustration
STRICT RULES (from content-rules.md):
- NEVER depict God, Jesus's face, or any deity
- Jesus only from behind, silhouette, or at distance
- No non-Christian religious symbols or architecture
- No nudity, no meditation poses, no text in images
- Always include AI disclosure in alt text
Embedding Generation
Both paths generate embeddings using the same model and format.
MODEL: text-embedding-3-small (OpenAI API)
DIMENSIONS: 1536
TEXT FORMAT: "Scripture: {ref}\n\nAuthor: {author}\n\nSource: {source}\n\nContent: {content}"
COLUMN: embedding (vector(1536))
TRACKING: embedding_model column per row
NOTE: Embeddings still use OpenAI API (no CLI alternative).
Use --skip-embeddings flag to defer embedding generation.
CRITICAL: If embedding model ever changes, ALL embeddings must be
regenerated together. Mixed embedding spaces break vector search.
View Read Layer
After content is inserted, it becomes visible in the directory immediately through a live SQL view.
SOURCE TABLE: public.unified_rag_content (327K+ rows)
|
v (live — no refresh needed)
VIEW: dir_illustrations (regular SQL view, ~50K rows)
- Filters: content_category = 'illustration', is_active = true
- Includes 26 content types
- Includes structured data fields
NOTE: dir_illustrations is NOT a materialized view. Content appears immediately.
If rows are missing, check unified_rag_content.is_active and curation_status.
Content Quality Rules (content-rules.md)
WORD COUNTS:
Standard illustrations: 180-280 words
Never under 100 words (stubs)
Commentary (churchwiseai_commentary): 300-500 words
BANNED PHRASES (quality_score penalty):
"Consider how [scripture] speaks to [topic]"
"In a world where..."
"A story that demonstrates..."
"This modern example reminds us..."
Any [template brackets] leftover
GOD NAMES: Always capitalized (Yahweh, Jehovah, Elohim, Adonai, El Shaddai)
FOREIGN WORDS: Wrapped in *asterisks* with English meaning
TITLES: Wrapped in *asterisks* (books, movies, songs)
VISIBILITY TIERS (assigned by assign-visibility-tiers.mjs):
public (~18%): historical_illustration + top ~50 per non-premium type
free_signup (~68%): majority of library, requires free account
premium (~14%): premium content types, requires $9.95/mo subscription
DATABASE RULES:
Always update embedding, summary, teaser, word_count, quality_score
when changing content
Never delete rows
Never change source_type, content_type, content_category, content_format
Use service role key (bypasses RLS)
Run in small batches with rate limiting
Key Constraint
ALWAYS use claude -p (Claude CLI) for content generation, NEVER use Anthropic/OpenAI APIs. The founder pays $200/mo for Claude Max. API calls should only be used for real-time product features (chatbot, voice agent) and for embedding generation (no CLI alternative).