The Faithfulness Gap — MusIML @ ICML 2026

Abstract

Autoformalization is bottlenecked by faithfulness, not fluency.

A formal statement can typecheck, be provable, and still encode a different theorem than the one the mathematician wrote. We introduce Bidirectional Provability Fingerprinting (BPF), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement.

We introduce four complementary innovations — CPG (counterfactual probe generation), the Equivalence Spectrum (continuous faithfulness score), APBA (adaptive budget allocation), and FGD (faithfulness-guided decoding) — and prove a drift detection theorem and a PAC-faithfulness result. We release DriftBench, the first benchmark of 2,183 NL/Lean 4 pairs with controlled drift labels across six subfields of mathlib4.

The Problem

A formalization can be correct-looking and wrong.

All three Lean 4 statements below typecheck and can be proved by a downstream prover. Only (A) faithfully captures the Extreme Value Theorem.

✓ (A) Faithful

-- Correct: requires Nonempty
theorem evt_A
    {K : Set ℝ}
    (hK : IsCompact K)
    (hne : K.Nonempty)
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∃ x ∈ K, ∀ y ∈ K,
      f y ≤ f x

✗ (B) Hypothesis Omission

-- ΔH: missing Nonempty
-- false on empty compact set
theorem evt_B
    {K : Set ℝ}
    (hK : IsCompact K)
    -- hne missing!
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∃ x ∈ K, ∀ y ∈ K,
      f y ≤ f x

⚠ (C) Quantifier Inversion

-- Δ∀: swapped ∃x∀y → ∀y∃x
-- trivially true (take x = y)
theorem evt_C
    {K : Set ℝ}
    (hK : IsCompact K)
    (hne : K.Nonempty)
    {f : ℝ → ℝ}
    (hf : ContinuousOn f K) :
    ∀ y ∈ K, ∃ x ∈ K,
      f y ≤ f x

Why existing checks fail: Typechecking accepts all three. A downstream prover closes all three — (C) is trivially true (take x = y), giving no signal that it is unrelated to the Extreme Value Theorem. BLEU against a reference catches neither if the reference is phrased slightly differently. Current pipelines accept all three as valid outputs. A pilot study found 19% of outputs from a state-of-the-art autoformalizer on graduate-level analysis exhibit at least one form of semantic drift.

Drift Taxonomy

Four canonical ways a formalization fails

These four classes account for 94% of human-labeled drift instances in a pilot study of 400 outputs from a state-of-the-art autoformalizer.

Δ∀ — Quantifier Inversion

Swapped quantifier order

Replaces ∀x∃y with ∃y∀x or vice versa. The drifted formalization is often provable — even trivially so — yet semantically disjoint from N. Most pernicious in practice because the statement remains provable.

Witnessability rate: α ≈ 0.62 | k ≈ 7 probes for δ = 0.01

ΔH — Hypothesis Omission

Missing premise

A stated premise of N is absent from F. The formalization is strictly stronger than intended: it claims a result without a needed assumption. Detected by CPG's hypothesis-drop probes — removing the probe budget for this class costs 9 F1 points.

Witnessability rate: α ≈ 0.55 | k ≈ 8 probes for δ = 0.01

ΔC — Conclusion Generalisation

Weakened conclusion

F ⇒ N but N ¬⇒ F. The conclusion in the formalization is strictly weaker than intended. Locally plausible — mathematically sound — but globally unsound as a representation of the intended claim.

Witnessability rate: α ≈ 0.48 | k ≈ 10 probes for δ = 0.01

ΔT — Type Coercion

Silent type retagging

Silent retagging, e.g., ℝ → ℚ or total → partial or Group → Monoid. The statement may remain provable in the new type, but it encodes a different mathematical claim. Hardest to detect syntactically.

Witnessability rate: α ≈ 0.31 | Lowest; structural test used as complement

Method

Bidirectional Provability Fingerprinting (BPF)

A statement's meaning is determined by what it implies and what implies it. BPF operationalizes this as a probe-based, reference-free certifier: the forward fingerprint Φ⁺ captures downstream commitments; the backward fingerprint Φ⁻ captures upstream conditions. Forward-only fingerprints cannot distinguish a strict strengthening of N from N itself; bidirectional matching is the minimum sufficient signal (Proposition 4.2).

① Input

Natural-Language Statement N

Candidate Formalization F

② Probe Pipeline (CPG)

CPG — Counterfactual Probe Gen.

Probe Label Predictor

Probe Set 𝒫_N + labels

Autoformalize Probes

③ Entailment Oracle (BPF, APBA)

APBA Budget Router

Lean 4 + mathlib Entailment Oracle

Φ⁺_F Forward

Φ⁻_F Backward

④ Decision (Equivalence Spectrum)

𝒮(F,N) ∈ [0,1]

≥0.93
ACCEPT

0.78–0.93
REVIEW

<0.78
REJECT

CPG

Counterfactual Probe Generation

For each drift class Δ, constructs a mechanically perturbed twin N^Δ and generates probes that distinguish N from N^Δ. Reaches BPF-naive's k=32 detection rate at k≈12 — halving the oracle budget while improving coverage on three of four drift classes.

Equivalence Spectrum

Continuous Faithfulness Score

𝒮(F,N) ∈ [0,1] replaces brittle binary verdicts. Enables ranking of multiple candidates from an autoformalizer, triage of borderline cases to human inspection, and continuous reward shaping for FGD. Calibrated thresholds: ACCEPT (≥0.93) / REVIEW / REJECT (<0.78).

APBA

Adaptive Probe Budget Allocation

Information-theoretic budget router: at each step selects the next drift class to probe by maximising expected reduction in posterior entropy. Achieves a 3.2× budget reduction over uniform sampling while preserving detection performance — confirmed by Theorem 6.1.

FGD

Faithfulness-Guided Decoding

Uses 𝒮 as a reward during sample-and-rerank decoding. Candidates with 𝒮 < 0.78 trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint, biasing the autoformalizer away from the detected drift class.

Results

BPF + CPG dominates across all operating points

Controlled DriftBench split. BPF + CPG reduces the residual error of the best non-fingerprinting baseline (LLM-judge) by approximately 72% at matched FPR. Adding APBA preserves detection while cutting wall-clock cost 3.2×.

Method	F1	Det @ 3% FPR	FPR	Cost
Typecheck only	0.27	11.4%	2.0%	1×
Provability	0.42	41.2%	3.0%	18×
BLEU-ref	0.55	30.1%	3.0%	1×
Back-translation	0.66	58.9%	3.0%	4×
LLM-judge	0.71	63.3%	3.0%	2×
BPF-naive	0.85	83.2%	3.1%	22×
BPF + CPG	0.91	89.6%	3.0%	26×
BPF + CPG + APBA	0.91	89.4%	3.1%	8×

Per-class detection rate (at 3% FPR)

Δ∀ Quantifier Inv.

Typecheck

7%

Provability

46%

LLM-judge

74%

BPF+CPG

94%

ΔH Hyp. Omission

Typecheck

5%

Provability

62%

LLM-judge

67%

BPF+CPG

93%

ΔC Concl. Gen.

Typecheck

4%

Provability

33%

LLM-judge

64%

BPF+CPG

91%

ΔT Type Coercion

Typecheck

31%

Provability

18%

LLM-judge

42%

BPF+CPG

66%

ΔT (type coercion) remains the hardest class, consistent with its lower witnessability rate α(ΔT) ≈ 0.31. On the wild split (384 pairs), BPF+CPG achieves κ = 0.77 vs. inter-annotator κ = 0.81.

DriftBench

The first benchmark with controlled drift labels

2,183

NL / Lean 4 Pairs

6

mathlib4 Subfields

5

Drift Label Classes

κ 0.81

Inter-Annotator (Wild)

Pairs are constructed using deterministic perturbation rules (not LLM sampling) so that drift labels are unambiguous. A complementary wild split contains 384 pairs where the candidate was generated by an LLM autoformalizer and independently annotated by two experts (κ = 0.81).

Analysis

455 pairs Faith:198 Δ∀:64 ΔH:71 ΔC:58 ΔT:33 Comb.:31

Algebra

396 pairs Faith:173 Δ∀:51 ΔH:68 ΔC:49 ΔT:28 Comb.:27

Topology

372 pairs Faith:162 Δ∀:49 ΔH:59 ΔC:47 ΔT:30 Comb.:25

Number Theory

326 pairs Faith:144 Δ∀:42 ΔH:55 ΔC:41 ΔT:22 Comb.:22

Combinatorics

316 pairs Faith:138 Δ∀:43 ΔH:51 ΔC:39 ΔT:24 Comb.:21

Category Theory

267 pairs Faith:117 Δ∀:35 ΔH:44 ΔC:33 ΔT:19 Comb.:19

Faithful Δ∀ ΔH ΔC ΔT Combined

Theory

Formal guarantees for drift detection

Theorem 7.1 — Drift Detection

High-Probability Detection

For any drift class D and candidate F exhibiting drift of class D, if the probe generator samples with witnessability rate α(D) > 0, then with budget k ≥ log(1/δ) / α(D), BPF flags F as drifted with probability at least 1−δ (assuming a complete entailment oracle). Empirical rates: α(Δ∀)≈0.62, α(ΔH)≈0.55, α(ΔC)≈0.48, α(ΔT)≈0.31.

Theorem 7.2 — PAC-Faithfulness

Polynomial Learnability of [N]_T

With k ≥ (1/α₀) log(1/δϵ) probes, BPF produces a hypothesis that correctly classifies F ∈ [N]_T with probability ≥ 1−ϵ over the autoformalizer's output distribution, with confidence 1−δ. Budget scales as O(log(1/δ) / ϵ). The proof recasts BPF as a hypothesis class indexed by fingerprints and applies an Occam-style bound.

Lean 4's tactic stack is incomplete: some true entailments time out. With oracle failure probability η(τ), detection degrades to 1−δ−kη(τ); empirically η(30s)≈0.07 on DriftBench probes. Convention drift (notational differences invisible to provability) is addressed by a structural type-signature test.

FGD

Faithfulness-Guided Decoding

𝒮 can be plugged back into the autoformalizer as a reward signal during sample-and-rerank decoding. Candidates in the REJECT region trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint.

NL Statement N

→

Autoformalizer
sample m candidates

→

BPF 𝒮 Scoring
k=8 probes / class

→

Rerank → F*

If 𝒮 < 0.78, witness probe is fed back to autoformalizer as an in-context repair hint (dashed loop in architecture diagram)

19.4%

Baseline drift rate

State-of-the-art autoformalizer, no FGD

→

10.3%

FGD, m=4 candidates

47% relative reduction; re-prompt drives most gains

8.7%

FGD, m=8 candidates

Further improvement with larger candidate pool

Ablations

Component contributions on the controlled split

CPG and bidirectional matching are the two largest individual contributors. APBA preserves performance even at reduced budget (k=8).

Full BPF + CPG

0.91

k=8 + APBA

0.89

− boundary probes

0.88

− adversarial probes

0.87

− counterfactual probes (CPG)

0.85

− hypothesis-drop probes

0.82

− bidirectional (forward only)

0.80

k=8 uniform (no APBA)

0.83

Citation

Cite this work

@inproceedings{mohammad2026the, title = {The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements}, author = {Noor Islam S. Mohammad and Tamim Sheikh}, booktitle = {The 6th Muslims in ML (MusIML) Workshop at ICML 2026}, year = {2026}, url = {https://openreview.net/forum?id=bDKvdZJVgK} }