Under the Hood of a Reasoning Model

Backtracking: Feature that identifies when the model recognizes mistakes in its reasoning and explicitly corrects itself.

Self reference: Feature that captures when the model refers back to its earlier statements or analyses in the response. https://www.goodfire.ai/research/under-the-hood-of-a-reasoning-model#

Rakaar's Notes