Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements
1 Dept. of Computer Science, Istanbul Technical University, Türkiye
2 Dept. of CS & Engineering, Jashore University of Science and Technology, Bangladesh
* Corresponding author
A formal statement can typecheck, be provable, and still encode a different theorem than the one the mathematician wrote. We introduce Bidirectional Provability Fingerprinting (BPF), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement.
We introduce four complementary innovations — CPG (counterfactual probe generation), the Equivalence Spectrum (continuous faithfulness score), APBA (adaptive budget allocation), and FGD (faithfulness-guided decoding) — and prove a drift detection theorem and a PAC-faithfulness result. We release DriftBench, the first benchmark of 2,183 NL/Lean 4 pairs with controlled drift labels across six subfields of mathlib4.
All three Lean 4 statements below typecheck and can be proved by a downstream prover. Only (A) faithfully captures the Extreme Value Theorem.
-- Correct: requires Nonempty theorem evt_A {K : Set ℝ} (hK : IsCompact K) (hne : K.Nonempty) {f : ℝ → ℝ} (hf : ContinuousOn f K) : ∃ x ∈ K, ∀ y ∈ K, f y ≤ f x
-- ΔH: missing Nonempty -- false on empty compact set theorem evt_B {K : Set ℝ} (hK : IsCompact K) -- hne missing! {f : ℝ → ℝ} (hf : ContinuousOn f K) : ∃ x ∈ K, ∀ y ∈ K, f y ≤ f x
-- Δ∀: swapped ∃x∀y → ∀y∃x -- trivially true (take x = y) theorem evt_C {K : Set ℝ} (hK : IsCompact K) (hne : K.Nonempty) {f : ℝ → ℝ} (hf : ContinuousOn f K) : ∀ y ∈ K, ∃ x ∈ K, f y ≤ f x
These four classes account for 94% of human-labeled drift instances in a pilot study of 400 outputs from a state-of-the-art autoformalizer.
Replaces ∀x∃y with ∃y∀x or vice versa. The drifted formalization is often provable — even trivially so — yet semantically disjoint from N. Most pernicious in practice because the statement remains provable.
A stated premise of N is absent from F. The formalization is strictly stronger than intended: it claims a result without a needed assumption. Detected by CPG's hypothesis-drop probes — removing the probe budget for this class costs 9 F1 points.
F ⇒ N but N ¬⇒ F. The conclusion in the formalization is strictly weaker than intended. Locally plausible — mathematically sound — but globally unsound as a representation of the intended claim.
Silent retagging, e.g., ℝ → ℚ or total → partial or Group → Monoid. The statement may remain provable in the new type, but it encodes a different mathematical claim. Hardest to detect syntactically.
A statement's meaning is determined by what it implies and what implies it. BPF operationalizes this as a probe-based, reference-free certifier: the forward fingerprint Φ+ captures downstream commitments; the backward fingerprint Φ− captures upstream conditions. Forward-only fingerprints cannot distinguish a strict strengthening of N from N itself; bidirectional matching is the minimum sufficient signal (Proposition 4.2).
For each drift class Δ, constructs a mechanically perturbed twin NΔ and generates probes that distinguish N from NΔ. Reaches BPF-naive's k=32 detection rate at k≈12 — halving the oracle budget while improving coverage on three of four drift classes.
𝒮(F,N) ∈ [0,1] replaces brittle binary verdicts. Enables ranking of multiple candidates from an autoformalizer, triage of borderline cases to human inspection, and continuous reward shaping for FGD. Calibrated thresholds: ACCEPT (≥0.93) / REVIEW / REJECT (<0.78).
Information-theoretic budget router: at each step selects the next drift class to probe by maximising expected reduction in posterior entropy. Achieves a 3.2× budget reduction over uniform sampling while preserving detection performance — confirmed by Theorem 6.1.
Uses 𝒮 as a reward during sample-and-rerank decoding. Candidates with 𝒮 < 0.78 trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint, biasing the autoformalizer away from the detected drift class.
Controlled DriftBench split. BPF + CPG reduces the residual error of the best non-fingerprinting baseline (LLM-judge) by approximately 72% at matched FPR. Adding APBA preserves detection while cutting wall-clock cost 3.2×.
| Method | F1 | F1 Visualization | Det @ 3% FPR | FPR | Cost |
|---|---|---|---|---|---|
| Typecheck only | 0.27 | 11.4% | 2.0% | 1× | |
| Provability | 0.42 | 41.2% | 3.0% | 18× | |
| BLEU-ref | 0.55 | 30.1% | 3.0% | 1× | |
| Back-translation | 0.66 | 58.9% | 3.0% | 4× | |
| LLM-judge | 0.71 | 63.3% | 3.0% | 2× | |
| BPF-naive | 0.85 | 83.2% | 3.1% | 22× | |
| BPF + CPG | 0.91 | 89.6% | 3.0% | 26× | |
| BPF + CPG + APBA | 0.91 | 89.4% | 3.1% | 8× |
ΔT (type coercion) remains the hardest class, consistent with its lower witnessability rate α(ΔT) ≈ 0.31. On the wild split (384 pairs), BPF+CPG achieves κ = 0.77 vs. inter-annotator κ = 0.81.
Pairs are constructed using deterministic perturbation rules (not LLM sampling) so that drift labels are unambiguous. A complementary wild split contains 384 pairs where the candidate was generated by an LLM autoformalizer and independently annotated by two experts (κ = 0.81).
For any drift class D and candidate F exhibiting drift of class D, if the probe generator samples with witnessability rate α(D) > 0, then with budget k ≥ log(1/δ) / α(D), BPF flags F as drifted with probability at least 1−δ (assuming a complete entailment oracle). Empirical rates: α(Δ∀)≈0.62, α(ΔH)≈0.55, α(ΔC)≈0.48, α(ΔT)≈0.31.
With k ≥ (1/α0) log(1/δϵ) probes, BPF produces a hypothesis that correctly classifies F ∈ [N]T with probability ≥ 1−ϵ over the autoformalizer's output distribution, with confidence 1−δ. Budget scales as O(log(1/δ) / ϵ). The proof recasts BPF as a hypothesis class indexed by fingerprints and applies an Occam-style bound.
Lean 4's tactic stack is incomplete: some true entailments time out. With oracle failure probability η(τ), detection degrades to 1−δ−kη(τ); empirically η(30s)≈0.07 on DriftBench probes. Convention drift (notational differences invisible to provability) is addressed by a structural type-signature test.
𝒮 can be plugged back into the autoformalizer as a reward signal during sample-and-rerank decoding. Candidates in the REJECT region trigger a re-prompt with the highest-scoring witness probe as an in-context repair hint.
If 𝒮 < 0.78, witness probe is fed back to autoformalizer as an in-context repair hint (dashed loop in architecture diagram)
CPG and bidirectional matching are the two largest individual contributors. APBA preserves performance even at reduced budget (k=8).