LLM Bible Text Benchmarking
Measure how faithfully LLMs reproduce scripture.
Bible Bench evaluates large language models on their ability to accurately reproduce canonical scripture. Every verse is scored for fidelity using word-level diff analysis — giving you precise, reproducible benchmarks across models, campaigns, and the full 66-book biblical canon.
Verses Evaluated
31,102
Books Covered
66
LLM Models
3 of 9 so far
Platform Features
Everything you need to benchmark LLM scripture fidelity
From evaluation campaign management to verse-level drill-down, Bible Bench gives researchers and developers a complete toolkit for understanding how language models handle the Word of God - The King James Version.
Dashboard
Command center for scripture fidelity intelligence — live metrics, model leaderboards, and campaign health at a glance.
Campaigns
Scope and organize evaluation runs by model, bible version, and book set. Group results for meaningful cross-model comparisons.
Explorer
Drill down from bible to book to chapter to verse. Side-by-side canonical vs. LLM text with fidelity scores and diff breakdowns.
Models
Register and manage LLM providers and model versions. Track per-model performance across evaluation campaigns.
Runs
Monitor evaluation runs in real time. Inspect structural retry traces and recovery outcomes at chapter and verse level.
Evaluation Reports
Deep analytical reports for every evaluation dimension
Seven purpose-built reports surface the insights that matter — from scoring integrity and cross-model comparisons to coverage gaps and structural recovery analysis.
Audit
Verify scoring integrity across every evaluation level — verse, chapter, book, and bible — with full trace exception disclosure.
Database Explorer
Forensic field-level inspection of raw and processed LLM responses. Surface diff data and stored metrics for any verse result.
Aggregation Analysis
Cross-level roll-up validation. Verify that chapter, book, and bible scores are mathematically consistent with their verse-level sources.
Worst by Model
Identify the weakest verses for each evaluated model. Pinpoint where LLM fidelity breaks down at the granular verse level.
Model Summary
Cross-model performance comparison in a single view. Benchmark every registered LLM against one another on scripture fidelity.
Coverage Status
Track evaluation coverage across the full biblical canon. Identify gaps — which books, chapters, or verses are still pending evaluation by LLM model name.
Retry Traces
Structural recovery analysis showing how the three-tier retry system resolved alignment errors at the chapter and verse level.
Early Access
Ready to benchmark your models against scripture?
Bible Bench is currently in early access. Join the waitlist to get notified when evaluation capacity opens for your team.