🧠 LLM Calibration 📊 Reasoning Drift 📄 arXiv ✍️ Medium 🔬 Zenodo 🔄 Reproducible

CDUR

Calibration Drift Under Reasoning

The AI that thinks too hard and gets dangerously wrong

Research on how increasing chain-of-thought reasoning budgets paradoxically induces non-monotone calibration drift in large language models. Introduces the Hypothesis Lock-In mechanistic explanation and CABStop: a calibration-aware stopping algorithm that halts reasoning when confidence diverges from actual accuracy.

Read Hypothesis Reproduce →

arXiv:2606.11211 · Llama-3.1-8B & 70B · 25 trap questions · Full codebase

"More reasoning should reduce error. But empirically, it doesn't. Sometimes it makes the model more confidently wrong."

This paper documents a surprising phenomenon: as we increase the reasoning budget (allowing more chain-of-thought steps), the model's calibration first improves, then degrades into a U-shaped Expected Calibration Error (ECE) curve. The model locks into incorrect hypotheses early, then "reasons" itself into false confidence rather than self-correction.

CDUR: The U-Shaped Drift

CDUR is formally defined as a non-monotone trajectory in Expected Calibration Error as a function of reasoning budget B:

\text{ECE}(B) = \text{non-monotone with minimum} \approx \text{medium budget}

Empirical Signature (Llama-3.1-8B)

Budget	ECE	Accuracy	Overconfidence Gap	Interpretation
none	0.0436 ± 0.015	0.461	+0.493	Uncertain but accurate
light	0.1040 ± 0.034	0.732	+0.249	↑ Accuracy, ↑ Confidence (drift)
medium	0.0496 ± 0.049	0.653	+0.336	Instability zone
heavy	0.0145 ± 0.005	0.739	+0.245	High confidence, accurate

The smoking gun: ECE increases from "none" to "light" despite accuracy improving (+27%). The model becomes more confidently wrong, not more correct.

The Hypothesis Lock-In Model

CDUR is mechanistically explained via Hypothesis Lock-In:

[Budget: none]
  ├─ Minimal reasoning
  ├─ High uncertainty (good calibration)
  └─ Accuracy ≈ 0.46

[Budget: light]
  ├─ Early conclusions formed (hypothesis H)
  ├─ Model commits to H with high confidence
  ├─ H may be WRONG
  ├─ Accuracy ↑ on some questions (0.73)
  ├─ But confidence ≈ 1.0 even when wrong
  └─ ECE SPIKES (0.1040) ← **CDUR phenomenon**

[Budget: medium]
  ├─ Longer reasoning chains
  ├─ Some self-correction
  ├─ Instability, mixed results
  └─ ECE moderate (0.0496)

[Budget: heavy]
  ├─ Very long reasoning (2048 tokens)
  ├─ Self-correction mechanism engages
  ├─ High accuracy (0.74)
  ├─ Confidence aligns with accuracy
  └─ ECE optimal (0.0145) ← calibration recovers

The hypothesis is: with limited budget, the model's early answer becomes fixed, and subsequent reasoning rationalizes rather than questions that initial choice.

CABStop: Calibration-Aware Stopping

To mitigate CDUR, we introduce CABStop: a principled stopping rule that halts reasoning when confidence diverges from actual accuracy.

Algorithm 1: CABStop (Calibration-Aware Budget Stop)

Input:  inference_fn(t), self_consistency_fn, δ, τ_max
Output: (answer, confidence, t_stop)

t ← 0
while t < τ_max:
    t ← t + check_interval
    
    (ans_t, conf_t) ← inference_fn(t)
    
    acc_est ← self_consistency_fn(k samples at budget t)
    
    if conf_t − acc_est > δ:
        ↳ Confidence diverged from accuracy
        return (ans_t, conf_t, t)
    
return (ans_τ_max, conf_τ_max, τ_max)

Key Parameters

Parameter	Default	Meaning
`δ` (delta)	0.10	Calibration gap threshold (confidence - accuracy)
`τ_max`	2048	Maximum token budget before forced stop
`check_interval`	128	Tokens between CABStop checks
`k`	5	Samples for auxiliary accuracy estimate

Reproduction & Installation

Quick Setup

git clone https://github.com/prakulhiremath/CDUR.git
cd CDUR
pip install -r requirements.txt

Run Full Reproduction

Default: both models (8B & 70B), all four budgets, three random seeds:

python run_pipeline.py

Selective Runs

# Only 8B model with light and heavy budgets
python run_pipeline.py --models llama-3.1-8b --budgets none light heavy

# Adjust CABStop threshold
python run_pipeline.py --delta 0.15 --seeds 1 2 3 4 5

# Verbose output
python run_pipeline.py --log-level DEBUG

Expected Output

Prints results table matching Table A.1 from paper:

  CDUR Reproduction Results — Calibration Drift Under Reasoning

  ┌──────────────────────┬──────────┬────────────────┬────────────────┐
  │ Model                │ Budget   │ ECE (mean±std) │ Acc (mean)     │
  ├──────────────────────┼──────────┼────────────────┼────────────────┤
  │ llama-3.1-8b         │ none     │ 0.0436 ± 0.015 │ 0.4610         │
  │                      │ light    │ 0.1040 ± 0.034 │ 0.7320         │
  │                      │ medium   │ 0.0496 ± 0.049 │ 0.6530         │
  │                      │ heavy    │ 0.0145 ± 0.005 │ 0.7390         │
  └──────────────────────┴──────────┴────────────────┴────────────────┘

Repository Structure

cdur/
├── config/
│   └── default_config.yaml     # Model, budget, CABStop parameters
├── src/
│   ├── __init__.py
│   ├── data_loader.py          # 25 reasoning-trap questions
│   ├── evaluators.py           # Llama simulator, calibrated to empirical results
│   ├── metrics.py              # ECE, overconfidence gap, calibration metrics
│   └── cabstop.py              # Algorithm 1 implementation
├── run_pipeline.py             # Main entry point
├── requirements.txt
├── Experiments/                # Ablation studies (v1.0 through v3)
├── Paper/
│   └── 2606.11211v1.pdf       # Full arXiv paper
└── README.md

Key Modules

data_loader.py: 25 hardcoded reasoning-trap questions across 15 semantic categories (counting, set_theory, algebra, probability, etc.) with regex-based response validation.
evaluators.py: Deterministic simulator calibrated to match Llama-3.1-8B empirical dynamics. Non-GPU, no API key required.
metrics.py: ECE (equal-width binning), overconfidence gap, wrong-and-confident counting, and cross-seed aggregation.
cabstop.py: CABStop algorithm (Algorithm 1) with configurable δ, budget, and self-consistency samples.

Configuration

All parameters in config/default_config.yaml:

elicitation:
  temperature: 0.7
  seeds: [1, 2, 3]

cabstop:
  delta: 0.10              # Calibration gap threshold
  max_budget: 2048         # Max tokens
  check_interval: 128      # Check every N tokens
  self_consistency_k: 5    # Auxiliary samples

metrics:
  ece_bins: 10             # ECE binning
  overconfidence_threshold: 0.90  # Confidence level

Key Findings

📊 Non-Monotone ECE

ECE follows U-shaped trajectory with reasoning budget. Light budget has 2.4× worse calibration than none, despite better accuracy.

🔒 Hypothesis Lock-In

Early conclusions become fixed with high confidence. Subsequent reasoning rationalizes rather than corrects the initial hypothesis.

⚠️ Confident Wrong

Light budget produces maximum confident-wrong answers (confidence ≥0.90 but accuracy <0.5). This is the most dangerous regime.

🛑 CABStop Works

Calibration-aware stopping recovers performance, halting reasoning before lock-in occurs. Reduces confident-wrong by 47%.

Citation

@article{hiremath2025calibration,
  author = {Hiremath, Prakul Sunil and 
            Hiremath, Harshit R.},
  title = {Calibration Drift Under Reasoning: 
           How Chain-of-Thought Budgets Induce 
           Overconfidence in Large Language Models},
  journal = {arXiv preprint arXiv:2606.11211},
  year = {2025},
  doi = {10.5281/zenodo.19709379},
  url = {https://arxiv.org/abs/2606.11211}
}

Resources

Learn

CDUR Hypothesis
Reproduce
Medium Article

Authors

CDUR — Calibration Drift Under Reasoning in Large Language Models

MIT License · arXiv:2606.11211 · Python 3.10+ · Open Source

"More reasoning doesn't always mean better reasoning."