Welcome to Qiangwei Bai’s AI Blog

Exploring the frontiers of artificial intelligence, deep learning, and machine learning research

Beyond the Tow Stage Pipeline：Unifying SFT and RL

1. Why Integrate SFT and RL？ While Reinforcement Learning can effectively enhance a model’s reasoning capabilities, a crucial prerequisite is that the base model already possesses some degree of the relevant abilities. In RL training, further improvement is only possible if the model can sample correct trajectories through multiple rollouts. This undoubtedly limits the exploration space of RL. Therefore, the mainstream approach is to first equip the model with foundational abilities via SFT, and then leverage RL to further enhance these capabilities. However, some studies argue that this two-stage approach is not optimal: ...

Entropy Collapse and Mitigation Strategies

1. Policy Entropy and Entropy Collapse 1.1 Entropy Definition Let $x$ denote the prompt and $y$ denote the response. The policy $\pi_{\theta}$ outputs a probability distribution for a token $t$ as follows: $$ p_t=(p_{t,1},\dots,p_{t,|V|})=\pi_{\theta}(\cdot|x,y_{\lt t})=\text{softmax}(\frac{z_t}{T}) \quad (1) $$Here, $|V|$ is the size of the vocabulary, $z_t\in\mathbb{R}^V$ are the logits, and $T\in\mathbb{R}$ is the decoding temprature. The entropy for token $t$ is then given by: $$ H_t=-\sum_{j=1}^{|V|} p_{t,j}\log p_{t,j} \quad (2) $$1.2 Entropy Collapse and Model Performance In the early stages of RL training, the model’s entropy drops sharply. As entropy decreases, accuracy enters a period of rapid growth. However, the rapid depletion of entropy can lead to the model becoming overconfident, which in turn diminishes its exploration capabilities. Through empirical studies, [1] established a quantitative relationship between policy entropy $H$ and downstream task performance $R$: ...