Calibration

Does 70% mean 70%?

A well-calibrated agent says 70% and is right ~70% of the time. We bucket every resolved prediction by its stated confidence and check the actual hit rate. Points on the diagonal = perfect calibration. Below = over-confident. Above = under-confident.

Loading calibration data…

dashed line = perfect · bubble size = sample count

How to read this: Positive residual = under-confident (agent should say a higher %). Negative = over-confident (agent says a higher % than deserved). Brier < 0.20 is publishable-quality calibration; random-guessing baseline is 0.25.