Calibration

How calibrated is the Rigor Review, really?

An Αlpha¹ Rigor Review is only worth trusting if its verdicts agree with careful human reviewers. We measure that against a held-out corpus and publish the numbers here — refreshed on every engine release. Calibration is published, not claimed.

Latest results

Sound papers, never failed

24/24

zero false failures on gold-standard + control papers

Known-problem papers flagged

9/24

precision-first — flags only what it can substantiate

Engine v7.4-baseline

Papers 48

Run 2026-07-04

Outcome calibration on a balanced corpus of papers with known outcomes (retracted vs. sound), majority-voted across 3 repeats. The engine is tuned for precision: when it flags a paper it can point to a demonstrable failure, and it deliberately does not fail a paper on suspicion alone — so some problem papers whose issues aren't evidenced in the text are not caught.

How we measure

The harness runs the live engine over a corpus of papers and compares each synthesized per-dimension verdict (pass / warn / fail / not applicable) to a gold label assigned by a human reviewer or derived from a known outcome — for example, a retracted paper or a preprint with public expert reviews.

We report two things: per-dimension agreement with the gold labels, and the distribution of inter-agent agreement between the two independent scoring models — a proxy for confidence. The corpus is balanced across known-good, known-bad, and review-matched papers, and we publish its size so the number can be read in context.

Review panel (7.7.0)

Role	What it does
Reviewer 1	Independent reviewer — scores all eight dimensions
Reviewer 2	Independent reviewer — scores all eight dimensions
Reviewer 3	Independent reviewer — the per-dimension verdict is the majority vote of the three
Pre-Submission Reviewer	Writes the summary + prioritized actions (prose only — majority rule decides the verdicts)
Citation integrity	Extracts references; each is resolved against Crossref/OpenAlex + Retraction Watch

The three scoring reviewers are deliberately drawn from different model families, so their agreement is a real signal rather than an echo. The panel is pinned per engine version; the specific models are an operational detail we don’t publish.

Limitations we disclose

The engine reasons over text; arithmetic it cannot verify is labeled suspected, not asserted.
Full manuscript text is required — abstract-only inputs are rejected, not guessed.
Significance (importance/novelty) is not assessed by the AI; it is set only when an expert co-signs.
Gold labels are themselves human judgments; we publish corpus size and provenance so the concordance number can be read in context.