How rigorous is the AI, really?
An Αlpha¹ Assessment is only worth trusting if its verdicts agree with careful human reviewers. We measure that against a held-out corpus and publish the numbers here — refreshed on every engine release. Calibration is published, not claimed.
Latest results
scripts/rigor-benchmark.ts) and methodology are in place; headline numbers will appear here on the first run.How we measure
The harness runs the live engine over a corpus of papers and compares each synthesized per-dimension verdict (pass / warn / fail / not applicable) to a gold label assigned by a human reviewer or derived from a known outcome — for example, a retracted paper or a preprint with public expert reviews.
We report two things: per-dimension agreement with the gold labels, and the distribution of inter-agent agreement between the two independent scoring models — a proxy for confidence. The corpus is balanced across known-good, known-bad, and review-matched papers, and we publish its size so the number can be read in context.
Model stack (2.0.0)
| Role | Model |
|---|---|
| Agent 1 | tencent/hy3-preview Independent rigor scorer |
| Agent 2 | moonshotai/kimi-k2-thinking Independent scorer, different model family — so agreement is a real signal |
| Synthesizer | google/gemini-3-flash-preview Reconciles both agents into the published verdict |
| Citation integrity | google/gemini-2.5-flash Extracts references; each is resolved against Crossref/OpenAlex + Retraction Watch |
The two scoring agents are deliberately different model families so their agreement is a real signal, not an echo. The stack is configurable per environment and pinned per engine version.
Limitations we disclose
- The engine reasons over text; arithmetic it cannot verify is labeled suspected, not asserted.
- Full manuscript text is required — abstract-only inputs are rejected, not guessed.
- Significance (importance/novelty) is not assessed by the AI; it is set only when an expert co-signs.
- Gold labels are themselves human judgments; we publish corpus size and provenance so the concordance number can be read in context.