Calibration Drift Under Reasoning
The AI that thinks too hard and gets dangerously wrong
Research on how increasing chain-of-thought reasoning budgets paradoxically induces non-monotone calibration drift in large language models. Introduces the Hypothesis Lock-In mechanistic explanation and CABStop: a calibration-aware stopping algorithm that halts reasoning when confidence diverges from actual accuracy.
arXiv:2606.11211 Β· Llama-3.1-8B & 70B Β· 25 trap questions Β· Full codebase
"More reasoning should reduce error. But empirically, it doesn't. Sometimes it makes the model more confidently wrong."
This paper documents a surprising phenomenon: as we increase the reasoning budget (allowing more chain-of-thought steps), the model's calibration first improves, then degrades into a U-shaped Expected Calibration Error (ECE) curve. The model locks into incorrect hypotheses early, then "reasons" itself into false confidence rather than self-correction.
CDUR is formally defined as a non-monotone trajectory in Expected Calibration Error as a function of reasoning budget B:
| Budget | ECE | Accuracy | Overconfidence Gap | Interpretation |
|---|---|---|---|---|
| none | 0.0436 Β± 0.015 | 0.461 | +0.493 | Uncertain but accurate |
| light | 0.1040 Β± 0.034 | 0.732 | +0.249 | β Accuracy, β Confidence (drift) |
| medium | 0.0496 Β± 0.049 | 0.653 | +0.336 | Instability zone |
| heavy | 0.0145 Β± 0.005 | 0.739 | +0.245 | High confidence, accurate |
The smoking gun: ECE increases from "none" to "light" despite accuracy improving (+27%). The model becomes more confidently wrong, not more correct.
CDUR is mechanistically explained via Hypothesis Lock-In:
[Budget: none] ββ Minimal reasoning ββ High uncertainty (good calibration) ββ Accuracy β 0.46 [Budget: light] ββ Early conclusions formed (hypothesis H) ββ Model commits to H with high confidence ββ H may be WRONG ββ Accuracy β on some questions (0.73) ββ But confidence β 1.0 even when wrong ββ ECE SPIKES (0.1040) β **CDUR phenomenon** [Budget: medium] ββ Longer reasoning chains ββ Some self-correction ββ Instability, mixed results ββ ECE moderate (0.0496) [Budget: heavy] ββ Very long reasoning (2048 tokens) ββ Self-correction mechanism engages ββ High accuracy (0.74) ββ Confidence aligns with accuracy ββ ECE optimal (0.0145) β calibration recovers
The hypothesis is: with limited budget, the model's early answer becomes fixed, and subsequent reasoning rationalizes rather than questions that initial choice.
To mitigate CDUR, we introduce CABStop: a principled stopping rule that halts reasoning when confidence diverges from actual accuracy.
Algorithm 1: CABStop (Calibration-Aware Budget Stop)
Input: inference_fn(t), self_consistency_fn, Ξ΄, Ο_max
Output: (answer, confidence, t_stop)
t β 0
while t < Ο_max:
t β t + check_interval
(ans_t, conf_t) β inference_fn(t)
acc_est β self_consistency_fn(k samples at budget t)
if conf_t β acc_est > Ξ΄:
β³ Confidence diverged from accuracy
return (ans_t, conf_t, t)
return (ans_Ο_max, conf_Ο_max, Ο_max)
| Parameter | Default | Meaning |
|---|---|---|
Ξ΄ (delta) |
0.10 | Calibration gap threshold (confidence - accuracy) |
Ο_max |
2048 | Maximum token budget before forced stop |
check_interval |
128 | Tokens between CABStop checks |
k |
5 | Samples for auxiliary accuracy estimate |
git clone https://github.com/prakulhiremath/CDUR.git
cd CDUR
pip install -r requirements.txtDefault: both models (8B & 70B), all four budgets, three random seeds:
python run_pipeline.py# Only 8B model with light and heavy budgets
python run_pipeline.py --models llama-3.1-8b --budgets none light heavy
# Adjust CABStop threshold
python run_pipeline.py --delta 0.15 --seeds 1 2 3 4 5
# Verbose output
python run_pipeline.py --log-level DEBUGPrints results table matching Table A.1 from paper:
CDUR Reproduction Results β Calibration Drift Under Reasoning
ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β Model β Budget β ECE (meanΒ±std) β Acc (mean) β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββββββββΌβββββββββββββββββ€
β llama-3.1-8b β none β 0.0436 Β± 0.015 β 0.4610 β
β β light β 0.1040 Β± 0.034 β 0.7320 β
β β medium β 0.0496 Β± 0.049 β 0.6530 β
β β heavy β 0.0145 Β± 0.005 β 0.7390 β
ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββββββββ΄βββββββββββββββββcdur/ βββ config/ β βββ default_config.yaml # Model, budget, CABStop parameters βββ src/ β βββ __init__.py β βββ data_loader.py # 25 reasoning-trap questions β βββ evaluators.py # Llama simulator, calibrated to empirical results β βββ metrics.py # ECE, overconfidence gap, calibration metrics β βββ cabstop.py # Algorithm 1 implementation βββ run_pipeline.py # Main entry point βββ requirements.txt βββ Experiments/ # Ablation studies (v1.0 through v3) βββ Paper/ β βββ 2606.11211v1.pdf # Full arXiv paper βββ README.md
All parameters in config/default_config.yaml:
elicitation:
temperature: 0.7
seeds: [1, 2, 3]
cabstop:
delta: 0.10 # Calibration gap threshold
max_budget: 2048 # Max tokens
check_interval: 128 # Check every N tokens
self_consistency_k: 5 # Auxiliary samples
metrics:
ece_bins: 10 # ECE binning
overconfidence_threshold: 0.90 # Confidence levelECE follows U-shaped trajectory with reasoning budget. Light budget has 2.4Γ worse calibration than none, despite better accuracy.
Early conclusions become fixed with high confidence. Subsequent reasoning rationalizes rather than corrects the initial hypothesis.
Light budget produces maximum confident-wrong answers (confidence β₯0.90 but accuracy <0.5). This is the most dangerous regime.
Calibration-aware stopping recovers performance, halting reasoning before lock-in occurs. Reduces confident-wrong by 47%.
@article{hiremath2025calibration,
author = {Hiremath, Prakul Sunil and
Hiremath, Harshit R.},
title = {Calibration Drift Under Reasoning:
How Chain-of-Thought Budgets Induce
Overconfidence in Large Language Models},
journal = {arXiv preprint arXiv:2606.11211},
year = {2025},
doi = {10.5281/zenodo.19709379},
url = {https://arxiv.org/abs/2606.11211}
}
CDUR β Calibration Drift Under Reasoning in Large Language Models
MIT License Β· arXiv:2606.11211 Β· Python 3.10+ Β· Open Source
"More reasoning doesn't always mean better reasoning."