Entropy Collapse and Mitigation Strategies
1. Policy Entropy and Entropy Collapse 1.1 Entropy Definition Let $x$ denote the prompt and $y$ denote the response. The policy $\pi_{\theta}$ outputs a probability distribution for a token $t$ as follows: $$ p_t=(p_{t,1},\dots,p_{t,|V|})=\pi_{\theta}(\cdot|x,y_{\lt t})=\text{softmax}(\frac{z_t}{T}) \quad (1) $$Here, $|V|$ is the size of the vocabulary, $z_t\in\mathbb{R}^V$ are the logits, and $T\in\mathbb{R}$ is the decoding temprature. The entropy for token $t$ is then given by: $$ H_t=-\sum_{j=1}^{|V|} p_{t,j}\log p_{t,j} \quad (2) $$1.2 Entropy Collapse and Model Performance In the early stages of RL training, the model’s entropy drops sharply. As entropy decreases, accuracy enters a period of rapid growth. However, the rapid depletion of entropy can lead to the model becoming overconfident, which in turn diminishes its exploration capabilities. Through empirical studies, [1] established a quantitative relationship between policy entropy $H$ and downstream task performance $R$: ...