Team: Chenxin An, Zhihui Xie†, Xiaonan Li†, Lei Li†, Jun Zhang, Shansan Gong, Ming Zhong*
Jingjing Xu, Xipeng Qiu, Mingxuan Wang, Lingpeng Kong*
*: Project Leads; †: Significant Contributor
Affiliations: The University of Hong Kong, Bytedance Seed, Fudan University
<aside> 📌
We are thrilled to unveil our latest breakthroughs, POLARIS-7B-Preview
and POLARIS-4B-Preview
, which mark a new frontier in open-recipe reasoning models developed using academic-level resources. POLARIS-4B-Preview
is fine-tuned from Qwen3-4B
and POLARIS-7B-Preview
is fine-tuned from Deepseek-R1-Distill-Qwen-7B
. Our 4B model achieves an impressive 81.2% Pass@1 ****accuracy on AIME24 and 79.4% Pass@1 ****accuracy on AIME25, outperforming state-of-the-art commercial models like Claude-4-Opus
, Grok-3-Beta
, and o3-mini-high(2025/01/31)
via scaling reinforcement learning on open-source data. On AIME25, POLARIS astonishingly achieves comparable performance to Qwen3-235B-A22B
while using less than 2% of its parameters and can be deployed on consumer-grade GPUs.
To foster progress in scaling RL on advanced reasoning models, we are open-sourcing our dataset, code, and training details for the research community.
👨💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📖 [paper](comming soon), 🔎 Evaluation results
</aside>
<aside> ✅
Takeaways for post-training of advanced reasoning models
Polaris-4B-Preview
was trained for 28 days on 32 H800 GPUs, using a batch size of 128, a rollout size of 8, and a dataset of 30K open-source examples. We have 3 training stages with sequence lengths scaling up (40K, 48K, 52K) across 700 steps. For complete training details, please refer to the released training scripts.
</aside>
Current work (e.g., DeepscaleR) demonstrates that a small model (e.g., 1.5B parameters) can achieve surprising improvements in reasoning tasks through scaling RL training. However, when we apply their recipe to train more advanced reasoning models, we observe marginal improvements even decline during the RL training of Qwen3
. This suggests a critical gap in the open-source community's understanding of how to further scale RL on advanced reasoning models. To address this, we introduce POLARIS—a post-training recipe centered on calibrated data difficulty, enhanced data diversity, inference-time length scaling, and efficient training.
We are committed to transparency and will be open-sourcing our trained models, training code, and data to foster community progress.
Our POLARIS recipe builds upon a deep investigation on the training data difficulty. Specifically, we conduct controlled experiments regarding data difficulty measured by model pass rate, and choose public available training datasets to enable better reproducibility.
Our initial experiments involve training models of different scales on the public DeepScaleR dataset. While a 1.5B model shows significant performance gains as expected, a 7B model trained on the same data exhibits only marginal improvements. We observe that the 7B model's average reward quickly surpasses 0.7, indicating that the training set is too simple to drive further improvements.
This leads us to a core hypothesis: For effective RL training, the difficulty of the data must be carefully calibrated to the model's scale and capability.
To validate this, we analyze the difficulty distribution of the 40K samples in the DeepScaleR training set. We use Deepseek-R1-Distill-Qwen-7B
and its 1.5B
version to perform an offline evaluation, generating 8 solutions for each problem with a sampling temperature of 0.6
. The percentage of correct solutions serves as a proxy for the difficulty of each sample.
The results, shown in the figure below, are revealing.
Across both model scales, we observe that most problems are either very easy (8/8 correct solutions) or very hard (0/8 correct solutions).