LLM Bible Text Benchmarking

Measure how faithfully LLMs reproduce scripture.

Bible Bench evaluates large language models on their ability to accurately reproduce canonical scripture. Every verse is scored for fidelity using word-level diff analysis — giving you precise, reproducible benchmarks across models, campaigns, and the full 66-book biblical canon.

Get started Sign in

Verses Evaluated

31,102

Books Covered

LLM Models

3 of 9 so far

Platform Features

Everything you need to benchmark LLM scripture fidelity

From evaluation campaign management to verse-level drill-down, Bible Bench gives researchers and developers a complete toolkit for understanding how language models handle the Word of God - The King James Version.

Dashboard

Command center for scripture fidelity intelligence — live metrics, model leaderboards, and campaign health at a glance.

Learn more

Campaigns

Scope and organize evaluation runs by model, bible version, and book set. Group results for meaningful cross-model comparisons.

Learn more

Explorer

Drill down from bible to book to chapter to verse. Side-by-side canonical vs. LLM text with fidelity scores and diff breakdowns.

Learn more

Models

Learn more

Runs

Monitor evaluation runs in real time. Inspect structural retry traces and recovery outcomes at chapter and verse level.

Learn more

Evaluation Reports

Deep analytical reports for every evaluation dimension

Seven purpose-built reports surface the insights that matter — from scoring integrity and cross-model comparisons to coverage gaps and structural recovery analysis.

Early Access

Ready to benchmark your models against scripture?

Bible Bench is currently in early access. Join the waitlist to get notified when evaluation capacity opens for your team.

Join the Waitlist Explore features

Measure how faithfully LLMs reproduce scripture.

Everything you need to benchmark LLM scripture fidelity

Dashboard

Campaigns

Explorer

Models

Runs

Deep analytical reports for every evaluation dimension

Audit

Database Explorer

Aggregation Analysis

Worst by Model

Model Summary

Coverage Status

Retry Traces

Ready to benchmark your models against scripture?