Beyond the Tow Stage Pipeline:Unifying SFT and RL

1. Why Integrate SFT and RL? ​ While Reinforcement Learning can effectively enhance a model’s reasoning capabilities, a crucial prerequisite is that the base model already possesses some degree of the relevant abilities. In RL training, further improvement is only possible if the model can sample correct trajectories through multiple rollouts. This undoubtedly limits the exploration space of RL. ​ Therefore, the mainstream approach is to first equip the model with foundational abilities via SFT, and then leverage RL to further enhance these capabilities. However, some studies argue that this two-stage approach is not optimal: ...

July 17, 2025 · 9 min · Qiangwei Bai