DeepSeek-R1：通过强化学习激发大语言模型的推理能力-DeepSeek-R1-Incentivizing-Reasoning-Capability-in-LLMs-via-Reinforcement-Learning

manyouzhe

1个月前发布

1.33MB22页02714

第1页 / 共22页

第2页 / 共22页

第3页 / 共22页

第4页 / 共22页

第5页 / 共22页

该文档为免费文档，您可直接下载完整版进行阅读

文章版权归作者所有，未经允许请勿转载。

THE END

智慧城市

文本预览

1.IntroductionIn recent years,Large Language Models (LLMs)have been undergoing rapid iteration andevolution (Anthropic 2024 Google,2024 OpenAI,2024a),progressively diminishing the gaptowards Artificial General Intelligence(AGD).Recently,post-training has emerged as an important component of the full training pipeline.It has been shown to enhance accuracy on reasoning tasks,align with social values,and adaptto user preferences,all while requiring relatively minimal computational resources againstpre-training.In the context of reasoning capabilities,OpenAI's o1 (OpenAI,2024b)series modelswere the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process.This approach has achieved significant improvements in variousreasoning tasks,such as mathematics,coding,and scientific reasoning.However,the challengeof effective test-time scaling remains an open question for the research community.Several priorworks have explored various approaches,including process-based reward models(Lightmanet al.,2023;Uesato et al.,2022:Wang et al.,2023),reinforcement learning (Kumar et al.2024),and search algorithms such as Monte Carlo Tree Search and Beam Search(Feng et al.,2024 Trinhet al.,2024;Xin et al.2024).However,none of these methods has achieved general reasoningperformance comparable to OpenAI's o1 series models.In this paper,we take the first step toward improving language model reasoning capabilitiesusing pure reinforcement learning(RL).Our goal is to explore the potential of LLMs to developreasoning capabilities without any supervised data,focusing on their self-evolution througha pure RL process.Specifically,we use DeepSeek-V3-Base as the base model and employGRPO (Shao et al.,2024)as the RL framework to improve model performance in reasoning.During training,DeepSeek-R1-Zero naturally emerged with numerous powerful and interestingreasoning behaviors.After thousands of RLsteps,DeepSeek-R1-Zero exhibits super performanceon reasoning benchmarks.For instance,the pass@1 score on AIME 2024 increases from 15.6%to71.0%,and with majority voting,the score further improves to 86.7%,matching the performanceof OpenAI-01-0912.However,DeepSeek-R1-Zero encounters challenges such as poor readability,and languagemixing.To address these issues and further enhance reasoning performance,we introduceDeepSeek-R1,which incorporates a small amount of cold-start data and a multi-stage trainingpipeline.Specifically,we begin by collecting thousands of cold-start data to fine-tune theDeepSeek-V3-Base model.Following this,we perform reasoning-oriented RL like DeepSeek-R1-Zero.Upon nearing convergence in the RL process,we create new SFT data through rejectionsampling on the RL checkpoint,combined with supervised data from DeepSeek-V3 in domainssuch as writing,factual QA,and self-cognition,and then retrain the DeepSeek-V3-Base model.After fine-tuning with the new data,the checkpoint undergoes an additional RL process,takinginto account prompt

喜欢就支持一下吧

请登录后发表评论

登录注册

暂无评论内容