MusIML @ ICML 2026

The Faithfulness Gap

Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

Noor Islam S. Mohammad*1  &  Tamim Sheikh2

1 Dept. of Computer Science, Istanbul Technical University, Türkiye
2 Dept. of CS & Engineering, Jashore University of Science and Technology, Bangladesh
* Corresponding author

89.6%
Det. Rate @ 3% FPR
3.2×
Budget Reduction (APBA)
47%
Drift Reduction via FGD
▶ Live Provability Fingerprint Simulation
Agreement Drift witness Oracle timeout
𝒮(F,N) =  → verdict: 
Abstract

Autoformalization is bottlenecked by faithfulness, not fluency.

A formal statement can typecheck, be provable, and still encode a different theorem than the one the mathematician wrote. We introduce Bidirectional Provability Fingerprinting (BPF), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement.

We introduce four complementary innovations — CPG (counterfactual probe generation), the Equivalence Spectrum (continuous faithfulness score), APBA (adaptive budget allocation), and FGD (faithfulness-guided decoding) — and prove a drift detection theorem and a PAC-faithfulness result. We release DriftBench, the first benchmark of 2,183 NL/Lean 4 pairs with controlled drift labels across six subfields of mathlib4.


The Problem

A formalization can be correct-looking and wrong.

All three Lean 4 statements below typecheck and can be proved by a downstream prover. Only (A) faithfully captures the Extreme Value Theorem.

 (A) Faithful
-- Correct: requires Nonempty
theorem evt_A
    {K : Set ℝ}
    (hK : IsCompact K)
    (hne : K.Nonempty)
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∃ x ∈ K, ∀ y ∈ K,
      f y ≤ f x
 (B) Hypothesis Omission
-- ΔH: missing Nonempty
-- false on empty compact set
theorem evt_B
    {K : Set ℝ}
    (hK : IsCompact K)
    -- hne missing!
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∃ x ∈ K, ∀ y ∈ K,
      f y ≤ f x
 (C) Quantifier Inversion
-- Δ∀: swapped ∃x∀y → ∀y∃x
-- trivially true (take x = y)
theorem evt_C
    {K : Set ℝ}
    (hK : IsCompact K)
    (hne : K.Nonempty)
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∀ y ∈ K, ∃ x ∈ K,
      f y ≤ f x
Why existing checks fail: Typechecking accepts all three. A downstream prover closes all three — (C) is trivially true (take x = y), giving no signal that it is unrelated to the Extreme Value Theorem. BLEU against a reference catches neither if the reference is phrased slightly differently. Current pipelines accept all three as valid outputs. A pilot study found 19% of outputs from a state-of-the-art autoformalizer on graduate-level analysis exhibit at least one form of semantic drift.
Drift Taxonomy

Four canonical ways a formalization fails

These four classes account for 94% of human-labeled drift instances in a pilot study of 400 outputs from a state-of-the-art autoformalizer.

Δ∀ — Quantifier Inversion
Swapped quantifier order

Replaces ∀x∃y with ∃y∀x or vice versa. The drifted formalization is often provable — even trivially so — yet semantically disjoint from N. Most pernicious in practice because the statement remains provable.

Witnessability rate: α ≈ 0.62  |  k ≈ 7 probes for δ = 0.01
ΔH — Hypothesis Omission
Missing premise

A stated premise of N is absent from F. The formalization is strictly stronger than intended: it claims a result without a needed assumption. Detected by CPG's hypothesis-drop probes — removing the probe budget for this class costs 9 F1 points.

Witnessability rate: α ≈ 0.55  |  k ≈ 8 probes for δ = 0.01
ΔC — Conclusion Generalisation
Weakened conclusion

F ⇒ N but N ¬⇒ F. The conclusion in the formalization is strictly weaker than intended. Locally plausible — mathematically sound — but globally unsound as a representation of the intended claim.

Witnessability rate: α ≈ 0.48  |  k ≈ 10 probes for δ = 0.01
ΔT — Type Coercion
Silent type retagging

Silent retagging, e.g., ℝ → ℚ or total → partial or Group → Monoid. The statement may remain provable in the new type, but it encodes a different mathematical claim. Hardest to detect syntactically.

Witnessability rate: α ≈ 0.31  |  Lowest; structural test used as complement
Method

Bidirectional Provability Fingerprinting (BPF)

A statement's meaning is determined by what it implies and what implies it. BPF operationalizes this as a probe-based, reference-free certifier: the forward fingerprint Φ+ captures downstream commitments; the backward fingerprint Φ captures upstream conditions. Forward-only fingerprints cannot distinguish a strict strengthening of N from N itself; bidirectional matching is the minimum sufficient signal (Proposition 4.2).

① Input
Natural-Language Statement N
Candidate Formalization F
② Probe Pipeline (CPG)
CPG — Counterfactual Probe Gen.
Probe Label Predictor
Probe Set 𝒫N + labels
Autoformalize Probes
③ Entailment Oracle (BPF, APBA)
APBA Budget Router
Lean 4 + mathlib Entailment Oracle
Φ+F Forward
ΦF Backward
④ Decision (Equivalence Spectrum)
𝒮(F,N) ∈ [0,1]
≥0.93
ACCEPT
0.78–0.93
REVIEW
<0.78
REJECT
CPG

Counterfactual Probe Generation

For each drift class Δ, constructs a mechanically perturbed twin NΔ and generates probes that distinguish N from NΔ. Reaches BPF-naive's k=32 detection rate at k≈12 — halving the oracle budget while improving coverage on three of four drift classes.

Equivalence Spectrum

Continuous Faithfulness Score

𝒮(F,N) ∈ [0,1] replaces brittle binary verdicts. Enables ranking of multiple candidates from an autoformalizer, triage of borderline cases to human inspection, and continuous reward shaping for FGD. Calibrated thresholds: ACCEPT (≥0.93) / REVIEW / REJECT (<0.78).

APBA

Adaptive Probe Budget Allocation

Information-theoretic budget router: at each step selects the next drift class to probe by maximising expected reduction in posterior entropy. Achieves a 3.2× budget reduction over uniform sampling while preserving detection performance — confirmed by Theorem 6.1.

FGD

Faithfulness-Guided Decoding

Uses 𝒮 as a reward during sample-and-rerank decoding. Candidates with 𝒮 < 0.78 trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint, biasing the autoformalizer away from the detected drift class.

Results

BPF + CPG dominates across all operating points

Controlled DriftBench split. BPF + CPG reduces the residual error of the best non-fingerprinting baseline (LLM-judge) by approximately 72% at matched FPR. Adding APBA preserves detection while cutting wall-clock cost 3.2×.

Method F1 F1 Visualization Det @ 3% FPR FPR Cost
Typecheck only 0.27
11.4%2.0%
Provability 0.42
41.2%3.0%18×
BLEU-ref 0.55
30.1%3.0%
Back-translation 0.66
58.9%3.0%
LLM-judge 0.71
63.3%3.0%
BPF-naive 0.85
83.2%3.1%22×
BPF + CPG 0.91
89.6%3.0%26×
BPF + CPG + APBA 0.91
89.4%3.1%

Per-class detection rate (at 3% FPR)

Δ∀ Quantifier Inv.
Typecheck
7%
Provability
46%
LLM-judge
74%
BPF+CPG
94%
ΔH Hyp. Omission
Typecheck
5%
Provability
62%
LLM-judge
67%
BPF+CPG
93%
ΔC Concl. Gen.
Typecheck
4%
Provability
33%
LLM-judge
64%
BPF+CPG
91%
ΔT Type Coercion
Typecheck
31%
Provability
18%
LLM-judge
42%
BPF+CPG
66%

ΔT (type coercion) remains the hardest class, consistent with its lower witnessability rate α(ΔT) ≈ 0.31. On the wild split (384 pairs), BPF+CPG achieves κ = 0.77 vs. inter-annotator κ = 0.81.

DriftBench

The first benchmark with controlled drift labels

2,183
NL / Lean 4 Pairs
6
mathlib4 Subfields
5
Drift Label Classes
κ 0.81
Inter-Annotator (Wild)

Pairs are constructed using deterministic perturbation rules (not LLM sampling) so that drift labels are unambiguous. A complementary wild split contains 384 pairs where the candidate was generated by an LLM autoformalizer and independently annotated by two experts (κ = 0.81).

Analysis
455 pairs   Faith:198  Δ∀:64  ΔH:71  ΔC:58  ΔT:33  Comb.:31
Algebra
396 pairs   Faith:173  Δ∀:51  ΔH:68  ΔC:49  ΔT:28  Comb.:27
Topology
372 pairs   Faith:162  Δ∀:49  ΔH:59  ΔC:47  ΔT:30  Comb.:25
Number Theory
326 pairs   Faith:144  Δ∀:42  ΔH:55  ΔC:41  ΔT:22  Comb.:22
Combinatorics
316 pairs   Faith:138  Δ∀:43  ΔH:51  ΔC:39  ΔT:24  Comb.:21
Category Theory
267 pairs   Faith:117  Δ∀:35  ΔH:44  ΔC:33  ΔT:19  Comb.:19
Faithful Δ∀ ΔH ΔC ΔT Combined
Theory

Formal guarantees for drift detection

Theorem 7.1 — Drift Detection
High-Probability Detection

For any drift class D and candidate F exhibiting drift of class D, if the probe generator samples with witnessability rate α(D) > 0, then with budget k ≥ log(1/δ) / α(D), BPF flags F as drifted with probability at least 1−δ (assuming a complete entailment oracle). Empirical rates: α(Δ∀)≈0.62, α(ΔH)≈0.55, α(ΔC)≈0.48, α(ΔT)≈0.31.

Theorem 7.2 — PAC-Faithfulness
Polynomial Learnability of [N]T

With k ≥ (1/α0) log(1/δϵ) probes, BPF produces a hypothesis that correctly classifies F ∈ [N]T with probability ≥ 1−ϵ over the autoformalizer's output distribution, with confidence 1−δ. Budget scales as O(log(1/δ) / ϵ). The proof recasts BPF as a hypothesis class indexed by fingerprints and applies an Occam-style bound.

Lean 4's tactic stack is incomplete: some true entailments time out. With oracle failure probability η(τ), detection degrades to 1−δ−kη(τ); empirically η(30s)≈0.07 on DriftBench probes. Convention drift (notational differences invisible to provability) is addressed by a structural type-signature test.

FGD

Faithfulness-Guided Decoding

𝒮 can be plugged back into the autoformalizer as a reward signal during sample-and-rerank decoding. Candidates in the REJECT region trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint.

NL Statement N
Autoformalizer
sample m candidates
BPF 𝒮 Scoring
k=8 probes / class
Rerank → F*

If 𝒮 < 0.78, witness probe is fed back to autoformalizer as an in-context repair hint (dashed loop in architecture diagram)

19.4%
Baseline drift rate
State-of-the-art autoformalizer, no FGD
10.3%
FGD, m=4 candidates
47% relative reduction; re-prompt drives most gains
8.7%
FGD, m=8 candidates
Further improvement with larger candidate pool
Ablations

Component contributions on the controlled split

CPG and bidirectional matching are the two largest individual contributors. APBA preserves performance even at reduced budget (k=8).

Full BPF + CPG
0.91
k=8 + APBA
0.89
− boundary probes
0.88
− adversarial probes
0.87
− counterfactual probes (CPG)
0.85
− hypothesis-drop probes
0.82
− bidirectional (forward only)
0.80
k=8 uniform (no APBA)
0.83
Citation

Cite this work

@inproceedings{mohammad2026the, title = {The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements}, author = {Noor Islam S. Mohammad and Tamim Sheikh}, booktitle = {The 6th Muslims in ML (MusIML) Workshop at ICML 2026}, year = {2026}, url = {https://openreview.net/forum?id=bDKvdZJVgK} }