Hi!
I think that models trained with distillation followed by reinforcement learning, or multiple distillation steps, should have a separate section (or at least an “SFT + RL” indication).
For instance for Qwen2.5-Math-7B, the Sky-T1-7B model is trained with 4-step SFT->RL->SFT->RL vs Oat-Zero and others which are trained in a single RL run.
Wdyt?
(aside: I think the leaderboard would benefit greatly from a verified / unverified section like SWE-bench so that new releases and be added and compared quickly. It would need an easy way to run the full pipeline locally, but I think this would be very useful to the community.)
Hi!
I think that models trained with distillation followed by reinforcement learning, or multiple distillation steps, should have a separate section (or at least an “SFT + RL” indication).
For instance for Qwen2.5-Math-7B, the Sky-T1-7B model is trained with 4-step SFT->RL->SFT->RL vs Oat-Zero and others which are trained in a single RL run.
Wdyt?
(aside: I think the leaderboard would benefit greatly from a verified / unverified section like SWE-bench so that new releases and be added and compared quickly. It would need an easy way to run the full pipeline locally, but I think this would be very useful to the community.)