🧠 LLM Calibration πŸ“Š Reasoning Drift πŸ“„ arXiv ✍️ Medium πŸ”¬ Zenodo πŸ”„ Reproducible

CDUR

Calibration Drift Under Reasoning

The AI that thinks too hard and gets dangerously wrong

Research on how increasing chain-of-thought reasoning budgets paradoxically induces non-monotone calibration drift in large language models. Introduces the Hypothesis Lock-In mechanistic explanation and CABStop: a calibration-aware stopping algorithm that halts reasoning when confidence diverges from actual accuracy.

arXiv:2606.11211 Β· Llama-3.1-8B & 70B Β· 25 trap questions Β· Full codebase

"More reasoning should reduce error. But empirically, it doesn't. Sometimes it makes the model more confidently wrong."

This paper documents a surprising phenomenon: as we increase the reasoning budget (allowing more chain-of-thought steps), the model's calibration first improves, then degrades into a U-shaped Expected Calibration Error (ECE) curve. The model locks into incorrect hypotheses early, then "reasons" itself into false confidence rather than self-correction.

CDUR: The U-Shaped Drift

CDUR is formally defined as a non-monotone trajectory in Expected Calibration Error as a function of reasoning budget B:

$$\text{ECE}(B) = \text{non-monotone with minimum} \approx \text{medium budget}$$

Empirical Signature (Llama-3.1-8B)

Budget ECE Accuracy Overconfidence Gap Interpretation
none 0.0436 Β± 0.015 0.461 +0.493 Uncertain but accurate
light 0.1040 Β± 0.034 0.732 +0.249 ↑ Accuracy, ↑ Confidence (drift)
medium 0.0496 Β± 0.049 0.653 +0.336 Instability zone
heavy 0.0145 Β± 0.005 0.739 +0.245 High confidence, accurate

The smoking gun: ECE increases from "none" to "light" despite accuracy improving (+27%). The model becomes more confidently wrong, not more correct.

The Hypothesis Lock-In Model

CDUR is mechanistically explained via Hypothesis Lock-In:

[Budget: none]
  β”œβ”€ Minimal reasoning
  β”œβ”€ High uncertainty (good calibration)
  └─ Accuracy β‰ˆ 0.46

[Budget: light]
  β”œβ”€ Early conclusions formed (hypothesis H)
  β”œβ”€ Model commits to H with high confidence
  β”œβ”€ H may be WRONG
  β”œβ”€ Accuracy ↑ on some questions (0.73)
  β”œβ”€ But confidence β‰ˆ 1.0 even when wrong
  └─ ECE SPIKES (0.1040) ← **CDUR phenomenon**

[Budget: medium]
  β”œβ”€ Longer reasoning chains
  β”œβ”€ Some self-correction
  β”œβ”€ Instability, mixed results
  └─ ECE moderate (0.0496)

[Budget: heavy]
  β”œβ”€ Very long reasoning (2048 tokens)
  β”œβ”€ Self-correction mechanism engages
  β”œβ”€ High accuracy (0.74)
  β”œβ”€ Confidence aligns with accuracy
  └─ ECE optimal (0.0145) ← calibration recovers

The hypothesis is: with limited budget, the model's early answer becomes fixed, and subsequent reasoning rationalizes rather than questions that initial choice.

CABStop: Calibration-Aware Stopping

To mitigate CDUR, we introduce CABStop: a principled stopping rule that halts reasoning when confidence diverges from actual accuracy.

Algorithm 1: CABStop (Calibration-Aware Budget Stop)

Input:  inference_fn(t), self_consistency_fn, Ξ΄, Ο„_max
Output: (answer, confidence, t_stop)

t ← 0
while t < Ο„_max:
    t ← t + check_interval
    
    (ans_t, conf_t) ← inference_fn(t)
    
    acc_est ← self_consistency_fn(k samples at budget t)
    
    if conf_t βˆ’ acc_est > Ξ΄:
        ↳ Confidence diverged from accuracy
        return (ans_t, conf_t, t)
    
return (ans_Ο„_max, conf_Ο„_max, Ο„_max)

Key Parameters

Parameter Default Meaning
Ξ΄ (delta) 0.10 Calibration gap threshold (confidence - accuracy)
Ο„_max 2048 Maximum token budget before forced stop
check_interval 128 Tokens between CABStop checks
k 5 Samples for auxiliary accuracy estimate

Reproduction & Installation

Quick Setup

git clone https://github.com/prakulhiremath/CDUR.git
cd CDUR
pip install -r requirements.txt

Run Full Reproduction

Default: both models (8B & 70B), all four budgets, three random seeds:

python run_pipeline.py

Selective Runs

# Only 8B model with light and heavy budgets
python run_pipeline.py --models llama-3.1-8b --budgets none light heavy

# Adjust CABStop threshold
python run_pipeline.py --delta 0.15 --seeds 1 2 3 4 5

# Verbose output
python run_pipeline.py --log-level DEBUG

Expected Output

Prints results table matching Table A.1 from paper:

  CDUR Reproduction Results β€” Calibration Drift Under Reasoning

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Model                β”‚ Budget   β”‚ ECE (meanΒ±std) β”‚ Acc (mean)     β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ llama-3.1-8b         β”‚ none     β”‚ 0.0436 Β± 0.015 β”‚ 0.4610         β”‚
  β”‚                      β”‚ light    β”‚ 0.1040 Β± 0.034 β”‚ 0.7320         β”‚
  β”‚                      β”‚ medium   β”‚ 0.0496 Β± 0.049 β”‚ 0.6530         β”‚
  β”‚                      β”‚ heavy    β”‚ 0.0145 Β± 0.005 β”‚ 0.7390         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Repository Structure

cdur/
β”œβ”€β”€ config/
β”‚   └── default_config.yaml     # Model, budget, CABStop parameters
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data_loader.py          # 25 reasoning-trap questions
β”‚   β”œβ”€β”€ evaluators.py           # Llama simulator, calibrated to empirical results
β”‚   β”œβ”€β”€ metrics.py              # ECE, overconfidence gap, calibration metrics
β”‚   └── cabstop.py              # Algorithm 1 implementation
β”œβ”€β”€ run_pipeline.py             # Main entry point
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Experiments/                # Ablation studies (v1.0 through v3)
β”œβ”€β”€ Paper/
β”‚   └── 2606.11211v1.pdf       # Full arXiv paper
└── README.md

Key Modules

Configuration

All parameters in config/default_config.yaml:

elicitation:
  temperature: 0.7
  seeds: [1, 2, 3]

cabstop:
  delta: 0.10              # Calibration gap threshold
  max_budget: 2048         # Max tokens
  check_interval: 128      # Check every N tokens
  self_consistency_k: 5    # Auxiliary samples

metrics:
  ece_bins: 10             # ECE binning
  overconfidence_threshold: 0.90  # Confidence level

Key Findings

πŸ“Š Non-Monotone ECE

ECE follows U-shaped trajectory with reasoning budget. Light budget has 2.4Γ— worse calibration than none, despite better accuracy.

πŸ”’ Hypothesis Lock-In

Early conclusions become fixed with high confidence. Subsequent reasoning rationalizes rather than corrects the initial hypothesis.

⚠️ Confident Wrong

Light budget produces maximum confident-wrong answers (confidence β‰₯0.90 but accuracy <0.5). This is the most dangerous regime.

πŸ›‘ CABStop Works

Calibration-aware stopping recovers performance, halting reasoning before lock-in occurs. Reduces confident-wrong by 47%.

Citation

@article{hiremath2025calibration,
  author = {Hiremath, Prakul Sunil and 
            Hiremath, Harshit R.},
  title = {Calibration Drift Under Reasoning: 
           How Chain-of-Thought Budgets Induce 
           Overconfidence in Large Language Models},
  journal = {arXiv preprint arXiv:2606.11211},
  year = {2025},
  doi = {10.5281/zenodo.19709379},
  url = {https://arxiv.org/abs/2606.11211}
}

CDUR β€” Calibration Drift Under Reasoning in Large Language Models

MIT License Β· arXiv:2606.11211 Β· Python 3.10+ Β· Open Source

"More reasoning doesn't always mean better reasoning."