Team: Chenxin An, Zhihui Xie†, Xiaonan Li†, Lei Li†, Jun Zhang, Shansan Gong, Ming Zhong*

Jingjing Xu, Xipeng Qiu, Mingxuan Wang, Lingpeng Kong*

*: Project Leads; †: Significant Contributor

Affiliations: The University of Hong Kong, Bytedance Seed, Fudan University

<aside> 📌

We are thrilled to unveil our latest breakthroughs, POLARIS-7B-Preview and POLARIS-4B-Preview, which mark a new frontier in open-recipe reasoning models developed using academic-level resources. POLARIS-4B-Preview is fine-tuned from Qwen3-4B and POLARIS-7B-Preview is fine-tuned from Deepseek-R1-Distill-Qwen-7B . Our 4B model achieves an impressive 81.2% Pass@1 ****accuracy on AIME24 and 79.4% Pass@1 ****accuracy on AIME25, outperforming state-of-the-art commercial models like Claude-4-Opus, Grok-3-Beta, and o3-mini-high(2025/01/31) via scaling reinforcement learning on open-source data. On AIME25, POLARIS astonishingly achieves comparable performance to Qwen3-235B-A22B while using less than 2% of its parameters and can be deployed on consumer-grade GPUs.

To foster progress in scaling RL on advanced reasoning models, we are open-sourcing our dataset, code, and training details for the research community.

👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📖 [paper](comming soon), 🔎 Evaluation results

</aside>

<aside> ✅

Takeaways for post-training of advanced reasoning models

Data Difficulty: Before training, Polaris analyzes and maps the distribution of data difficulty. The dataset should not be overwhelmed by either overly difficult or trivially easy problems. We recommend using a data distribution with a slight bias toward challenging problems, which typically exhibits a mirrored J-shaped distribution.
Diversity-Based Rollout: We leverage the diversity among rollouts to initialize the sampling temperature, which is then progressively increased throughout the RL training stages.
Inference-Time Length: Polaris incorporates length extrapolation techniques for generating longer CoT at inference stage. This enables a "train-short, generate-long" paradigm for CoT reasoning, mitigating the computational burden of training with excessively long rollouts .
Exploration Efficiency: Exploration efficiency in Polaris is enhanced through multi-stage training. However, reducing the model's response length in the first stage poses potential risks. A more conservative approach would be to directly allow the model to "think longer" from the beginning.

RL training for our 4B model requires 10 days on 32 H800 GPUs (about 0.33 hours per step), using a batch size of 128, a rollout size of 8. We have 3 training stages with sequence lengths scaling up (40K, 48K, 52K) across 700 steps. For complete training details, please refer to the released training scripts.

</aside>

POLARIS’s Recipe

Current work (e.g., DeepscaleR) demonstrates that a small model (e.g., 1.5B parameters) can achieve surprising improvements in reasoning tasks through scaling RL training. However, when we apply their recipe to train more advanced reasoning models, we observe marginal improvements even decline during the RL training of Qwen3. This suggests a critical gap in the open-source community's understanding of how to further scale RL on advanced reasoning models. To address this, we introduce POLARIS—a post-training recipe centered on calibrated data difficulty, enhanced data diversity, inference-time length scaling, and efficient training.

We are committed to transparency and will be open-sourcing our trained models, training code, and data to foster community progress.

1. Data Difficulty

Our POLARIS recipe builds upon a deep investigation on the training data difficulty. Specifically, we conduct controlled experiments regarding data difficulty measured by model pass rate, and choose public available training datasets to enable better reproducibility.

Balanced Data Difficulty Matters

Our initial experiments involve training models of different scales on the public DeepScaleR dataset. While a 1.5B model shows significant performance gains as expected, a 7B model trained on the same data exhibits only marginal improvements. We observe that the 7B model's average reward quickly surpasses 0.7, indicating that the training set is too simple to drive further improvements.

This leads us to a core hypothesis: For effective RL training, the difficulty of the data must be carefully calibrated to the model's scale and capability.

To validate this, we analyze the difficulty distribution of the 40K samples in the DeepScaleR training set. We use Deepseek-R1-Distill-Qwen-7B and its 1.5B version to perform an offline evaluation, generating 8 solutions for each problem with a sampling temperature of 0.6. The percentage of correct solutions serves as a proxy for the difficulty of each sample.

The results, shown in the figure below, are revealing.

截屏2025-06-03 16.17.55.png

Across both model scales, we observe that most problems are either very easy (8/8 correct solutions) or very hard (0/8 correct solutions).