深入了解DeepSeek的AI模型内部机制，并不能解答所有疑问。

qimuai 发布于 2025-12-10 13:01 阅读：64 一手编译

内容来源：https://www.sciencenews.org/article/ai-model-deepseek-answers-training

内容总结：

深度求索AI模型“推理”能力引关注，低成本训练路径与内部机制仍待解

近一年前，中国AI公司深度求索（DeepSeek）因其大语言模型在数学与编程基准测试中媲美OpenAI模型而引发业界震动。该公司宣称，其模型在保持低成本的同时实现了高性能，暗示AI进步未必依赖昂贵算力与顶级芯片。此后，一系列研究聚焦于其模型的“推理”方法，试图理解、改进乃至超越它。

深度求索模型引人注目之处不仅在于免费使用，更在于其采用的强化学习训练方式。与依赖海量人工标注数据不同，其R1-Zero和R1模型主要通过“试错”机制训练：模型尝试解决问题，答案正确则获得奖励，类似人类解谜过程。这种路径被认为大幅降低了训练成本——强化学习无需监督每一步骤，只需反馈最终结果好坏，减少了对标注数据与算力的依赖。

为验证成果，深度求索罕见地将模型提交给外部科学家评审，相关论文于去年9月发表于《自然》期刊。评审专家指出，此举相当于“亮出底牌”，有助于学界验证与改进算法。训练中，模型对每个问题尝试多种解法（如15种猜测），任一正确即可获得奖励。但若全部错误，则无法获得学习信号。因此，强化学习生效的前提是模型需具备较好的初始猜测能力，而深度求索的基础模型V3 Base已在此类问题上表现出较高准确率。

然而，模型内部工作机制仍不清晰。尽管深度求索模型在输出时会生成类人的“思考过程”，并随训练进展出现“顿悟时刻”等表述，但专家指出，这未必反映真实的推理步骤。由于奖励基于最终答案正确与否，即使中间步骤存在无关或错误路径，只要结果正确，整个生成过程仍会受奖励。这种基于结果的奖励机制，难以确保模型真正学会了高效的推理逻辑。

此外，基准测试的局限性也引发思考。在固定题目集上表现出色，可能源于模型在训练中记忆了答案，而非掌握了通用解题能力。专家强调，“在基准测试上表现优异”与“使用人类式推理过程解决问题”存在本质区别。过度解读AI的“推理”能力，可能导致用户不加批判地接受其结论，带来潜在风险。

目前，深度求索模型如何完成多步骤任务，其训练方法究竟在模型中注入了何种信息，仍是一个开放问题。尽管低成本训练路径展现了吸引力，但揭开AI模型“黑箱”之谜，仍是科研界持续探索的方向。

中文翻译：

深入探究DeepSeek的AI模型内部，并不能获得所有答案。我们仍不清楚这些模型如何完成多步骤的数学与编程任务。自DeepSeek在人工智能领域掀起波澜以来，已近一年时间。

今年1月，这家中国公司宣称其某个大型语言模型在评估多步骤问题解决能力（即AI领域所称的“推理能力”）的数学与编程基准测试中，达到了与OpenAI同类模型相当的水平。DeepSeek最引人注目的主张是：它在实现这一性能的同时保持了低成本。这意味着：AI模型的改进并非总是需要庞大的计算基础设施或顶尖的计算机芯片，或许通过高效利用成本更低的硬件也能实现。在这项引发广泛关注的声明之后，一系列研究随之展开，试图更好地理解DeepSeek模型的推理方法、改进它们甚至超越它们。

DeepSeek模型之所以引人入胜，不仅在于其免费使用的特性，更在于其训练方式。与使用成千上万人工标注数据点来训练模型解决难题的传统方法不同，DeepSeek的R1-Zero和R1模型完全或主要通过试错进行训练，并未被明确告知如何得出解决方案，这很像人类完成拼图的过程。当答案正确时，模型会因其行为获得奖励，这也是计算机科学家称这种方法为强化学习的原因。

对于希望提升大型语言模型推理能力的研究人员来说，DeepSeek的成果令人鼓舞，尤其是如果它能达到与OpenAI模型相当的性能，而据报道其训练成本仅为后者的一小部分。另一个令人振奋的进展是：DeepSeek将其模型提供给非公司科学家进行检验，以验证结果是否足以在《自然》杂志上发表——这对AI公司来说实属罕见。或许最让研究人员兴奋的是，他们希望了解该模型的训练和输出能否让我们窥见AI模型“黑箱”内部的奥秘。

亚利桑那州立大学坦佩分校的计算机科学家苏巴拉奥·坎巴姆帕蒂表示，通过将模型置于同行评审过程中，“DeepSeek基本上亮出了底牌”，以便他人验证和改进算法。他参与了DeepSeek于9月17日发表在《自然》杂志上的论文的同行评审。尽管他认为现在对任何DeepSeek模型的内部机制下结论还为时过早，但“科学本应如此运作”。

为何强化学习训练成本更低

训练所需的计算能力越强，成本就越高。而教会大型语言模型分解并解决多步骤任务（如数学竞赛中的题目集）已被证明代价高昂，且成功率参差不齐。在传统训练中，科学家通常会告诉模型正确答案是什么以及达到该答案所需的步骤。这需要大量人工标注数据和强大的计算能力。

匹兹堡大学的强化学习研究员艾玛·乔丹指出，强化学习则不需要这些。研究人员并非监督大型语言模型的每一个步骤，而是只告诉它表现如何。

强化学习如何塑造DeepSeek模型

研究人员已利用强化学习训练大型语言模型生成有用的聊天机器人文本并避免有害回应，其奖励基于模型行为与期望行为的契合度。但乔丹表示，由于人类阅读偏好具有主观性，基于奖励的训练在此类应用上并不完美。相比之下，强化学习在应用于数学和代码问题时可能表现出色，因为这些问题有可验证的答案。

9月《自然》杂志的论文详细阐述了强化学习为何能适用于DeepSeek模型。在训练过程中，模型尝试用不同方法解决数学和代码问题，答案正确则获得奖励1，否则为0。期望是通过这种试错与奖励的过程，模型能学会解决问题所需的中间步骤，从而掌握推理模式。

坎巴姆帕蒂解释说，在训练阶段，DeepSeek模型实际上并未完整解决问题。例如，模型会做出15次猜测。“如果这15次猜测中有任何一次正确，那么对于正确的那些，模型就会获得奖励，”他说，“而对于不正确的，则不会得到任何奖励。”

但这种奖励结构并不能保证问题一定被解决。“如果所有15次猜测都错了，那么你基本上得到零奖励。没有任何学习信号，”坎巴姆帕蒂指出。

要使这种奖励结构见效，DeepSeek必须有一个相当不错的“猜测者”作为起点。幸运的是，DeepSeek的基础模型V3 Base在推理问题上已经比OpenAI的GPT-4o等早期大型语言模型具有更高的准确率。实际上，这使模型更擅长猜测。坎巴姆帕蒂表示，如果基础模型已经足够好，以至于正确答案在其为某个问题生成的15个最可能答案之中，那么在学习过程中，其性能会得到提升，使正确答案成为其最可能的猜测。

但有一个注意事项：V3 Base可能擅长猜测，是因为DeepSeek研究人员从互联网上抓取了公开可用数据来训练它。研究人员在《自然》论文中写道，部分训练数据可能无意中包含了OpenAI或其他公司模型的输出。此外，他们以传统的监督方式训练了V3 Base，因此任何基于V3 Base开发的模型都可能包含这种反馈的成分，而不仅仅是强化学习。DeepSeek未回应《科学新闻》的置评请求。

在训练V3 Base以生成DeepSeek-R1-Zero时，研究人员使用了两种奖励类型：准确性和格式。对于数学问题，验证输出的准确性相对直接；奖励算法将大型语言模型的输出与正确答案核对并给予相应反馈。DeepSeek研究人员使用竞赛中的测试用例来评估代码。格式奖励则激励模型在提供最终解决方案前，描述其如何得出答案并标注该描述。

在基准数学和代码问题上，DeepSeek-R1-Zero的表现优于为基准研究选定的人类参与者，但该模型仍存在问题。例如，由于同时使用英文和中文数据进行训练，导致输出混合了两种语言，难以解读。因此，DeepSeek研究人员回头在训练流程中增加了额外的强化学习阶段，对语言一致性给予奖励，以防止混淆。于是，R1-Zero的继任者DeepSeek-R1诞生了。

大型语言模型现在能像人类一样推理吗？

表面上看，如果奖励引导模型得出正确答案，那么它在回应奖励时必然在进行推理决策。DeepSeek研究人员报告称，R1-Zero的输出表明它使用了推理策略。但坎巴姆帕蒂认为，我们并不真正理解模型内部的运作机制，其输出被过度拟人化，暗示它在“思考”。同时，探究AI模型“推理”的内部机制仍然是一个活跃的研究课题。

DeepSeek的格式奖励激励其模型响应采用特定结构。在模型生成最终答案前，它会以类似人类的语气生成“思考过程”，注明在何处检查中间步骤，这可能让用户认为其响应反映了其处理步骤。

AI模型如何“思考”

这段文本和方程式展示了DeepSeek模型输出格式的一个例子，概述了其在生成最终解决方案前的“思考过程”。

DeepSeek研究人员表示，随着训练的进行，模型的“思考过程”输出中，“顿悟时刻”和“等待”等词汇的出现频率更高，表明自我反思和推理行为的出现。此外，他们指出，模型为复杂问题生成更多的“思考标记”（模型处理问题时产生的字符、单词、数字或符号），为简单问题生成较少的思考标记，这表明它学会了为更难的问题分配更多的思考时间。

但坎巴姆帕蒂质疑，即使这些“思考标记”明显帮助了模型，它们是否真的向最终用户揭示了其处理步骤的任何实际信息。他认为这些标记并不对应于问题的某种逐步解决方案。在DeepSeek-R1-Zero的训练过程中，每一个促成正确答案的标记都会获得奖励，即使模型在通往正确答案的路上采取的一些中间步骤是无关紧要或走入死胡同的。他指出，这种基于结果的奖励模型并非设计为只奖励模型推理中有成效的部分以鼓励其更频繁地发生。“因此，仅基于结果奖励模型来训练系统，并自欺欺人地认为它学到了关于过程的东西，这是奇怪的。”

此外，众所周知，在诸如著名数学竞赛题目数据集等基准上测量的AI模型性能，并不足以充分表明模型解决问题的实际能力。“一般来说，判断一个系统是在真正通过推理来解决推理问题，还是利用记忆来解决推理问题，是不可能的，”坎巴姆帕蒂说。因此，他认为，一个包含固定问题集的静态基准无法准确传达模型的推理能力，因为模型可能在训练过程中，通过抓取互联网数据记住了正确答案。

坎巴姆帕蒂表示，AI研究人员似乎明白，当他们说大型语言模型在进行推理时，他们的意思是在推理基准测试中表现良好。但外行人可能会认为“如果模型得出了正确答案，那么它们一定遵循了正确的过程。”他说，“在基准测试中表现出色，与使用人类可能用来在该基准测试中表现出色的过程，是两件截然不同的事情。”对AI“推理”缺乏理解以及过度依赖此类AI模型可能存在风险，导致人类不加批判地接受AI的决策。

乔丹指出，一些研究人员正试图深入了解这些模型的工作原理，以及哪些训练程序真正将信息灌输到模型中，目的是降低风险。但是，截至目前，这些AI模型如何解决问题的内部机制仍然是一个悬而未决的问题。

英文来源：

A look under the hood of DeepSeek’s AI models doesn’t provide all the answers
It's still not obvious how the models works through multistep math and coding tasks
It’s been almost a year since DeepSeek made a major AI splash.
In January, the Chinese company reported that one of its large language models rivaled an OpenAI counterpart on math and coding benchmarks designed to evaluate multi-step problem solving capabilities, or what the AI field calls “reasoning.” DeepSeek’s buzziest claim was that it achieved this performance while keeping costs low. The implication: AI model improvements didn’t always need massive computing infrastructure or the very best computer chips but might be achieved by efficient use of cheaper hardware. A slew of research followed that headline-grabbing announcement, all trying to better understand DeepSeek models’ reasoning methods, improve them and even outperform them.
What makes the DeepSeek models intriguing is not only their price — free to use — but how they are trained. Instead of training the models to solve tough problems using thousands of human-labeled data points, DeepSeek’s R1-Zero and R1 models were trained exclusively or significantly through trial and error, without explicitly being told how to get to the solution, much like a human completing a puzzle. When an answer was correct, the model received a reward for its actions, which is why computer scientists call this method reinforcement learning.
To researchers looking to improve the reasoning abilities of large language models, or LLMs, DeepSeek’s results were inspiring, especially if it could perform as well as OpenAI’s models but be trained reportedly at a fraction of the cost. And there was another encouraging development: DeepSeek offered its models up to be interrogated by noncompany scientists to see if the results held true for publication in Nature— a rarity for an AI company. Perhaps what excited researchers most was to see if this model’s training and outputs could give us look inside the “black box” of AI models.
In subjecting its models to the peer review process, “DeepSeek basically showed its hand,” so that others can verify and improve the algorithms, says Subbarao Kambhampati, a computer scientist at Arizona State University in Tempe who peer reviewed DeepSeek’s September 17 Nature paper. Although he says it’s premature to make conclusions about what’s going on under any DeepSeek model’s hood, “that’s how science is supposed to work.”
Why training with reinforcement learning costs less
The more computing power training takes, the more it costs. And teaching LLMs to break down and solve multistep tasks like problem sets from math competitions has proven expensive, with varying degrees of success. During training, scientists commonly would tell the model what a correct answer is and the steps it needs to take to reach that answer. That’s a lot of human-annotated data and a lot of computing power.
You don’t need that for reinforcement learning. Rather than supervise the LLM’s every move, researchers instead only tell the LLM how well it did, says reinforcement learning researcher Emma Jordan of the University of Pittsburgh.
How reinforcement learning shaped DeepSeek’s model
Researchers have already used reinforcement learning to train LLMs to generate helpful chatbot text and avoid toxic responses, where the reward is based on its alignment to the preferred behavior. But aligning with human reading preferences is an imperfect use case for reward-based training because of the subjective nature of that exercise, Jordan says. In contrast, reinforcement learning can shine when applied to math and code problems, which have a verifiable answer.
September’s Nature publication details what made it possible for reinforcement learning to work for DeepSeek’s models. During training, the models try different approaches to solve math and code problems, receiving a reward of 1 if correct or a zero otherwise. The hope is that, through the trial-and-reward process, the model will learn the intermediate steps, and therefore the reasoning patterns, required to solve the problem.
In the training phase, the DeepSeek model does not actually solve the problem to completion, Kambhampati says. Instead, the model makes, say, 15 guesses. “And if any of the 15 are correct, then basically for the ones that are correct, [the model] gets rewarded,” Kambhampati says. “And the ones that are not correct, it won’t get any reward.”
But this reward structure doesn’t guarantee that a problem will be solved. “If all 15 guesses are wrong, then you are basically getting zero reward. There is no learning signal whatsoever,” Kambhampati says.
For the reward structure to bear fruit, DeepSeek had to have a decent guesser as a starting point. Fortunately, DeepSeek’s foundation model, V3 Base, already had better accuracies than older LLMs such as OpenAI’s GPT-4o on the reasoning problems. In effect, that made the models better at guessing. If the base model is already good enough such that the correct answer is in the top 15 probable answers it comes up with for a problem, during the learning process, its performance improves so that the correct answer is its top-most probable guess, Kambhampati says.
There is a caveat: V3 Base might have been good at guessing because DeepSeek researchers scraped publicly available data from the internet to train it. The researchers write in the Nature paper that some of that training data could have included outputs from OpenAI’s or others’ models, however unintentionally. They also trained V3 Base in the traditional supervised manner, so therefore some component of that feedback, and not solely reinforcement learning, could go into any model emerging from V3 Base. DeepSeek did not respond to SN‘s requests for comment.
When training V3 Base to produce DeepSeek-R1-Zero, researchers used two types of reward — accuracy and format. In the case of math problems, verifying the accuracy of an output is straightforward; the reward algorithm checks the LLM output against the correct answer and gives the appropriate feedback. DeepSeek researchers use test cases from competitions to evaluate code. Format rewards incentivize the model to describe how it arrived at an answer and to label that description before providing the final solution.
On the benchmark math and code problems, DeepSeek-R1-Zero performed better than the humans selected for the benchmark study, but the model still had issues. Being trained on both English and Chinese data, for example, led to outputs that mixed the languages, making the outputs hard to decipher. As a result, DeepSeek researchers went back and implemented an additional reinforcement learning stage in the training pipeline with a reward for language consistency to prevent the mix-up. Out came DeepSeek-R1, a successor to R1-Zero.
Can LLMs reason like humans now?
It might seem like if the reward gets the model to the right answer, it must be making reasoning decisions in its responses to rewards. And DeepSeek researchers report that R1-Zero’s outputs suggest that it uses reasoning strategies. But Kambhampati says that we don’t really understand how the models work internally and its outputs have been overly anthropomorphized to imply that it is thinking. Meanwhile, interrogating the inner workings of AI model “reasoning” remains an active research problem.
DeepSeek’s format reward incentivizes a specific structure for its model’s responses. Before the model produces the final answer, it generates its “thought process” in a humanlike tone, noting where it might check an intermediate step, which might make the user think that its responses mirror its processing steps.
How an AI model “thinks”
This string of text and equations shows an example of the DeepSeek model’s output format, outlining its “thinking process” before generating the final solution.
The DeepSeek researchers say that the model’s “thought process” output includes terms like ‘aha moment’ and ‘wait’ in higher frequency as the training progresses, indicating the emergence of self-reflective and reasoning behavior. Further, they say that the model generates more “thinking tokens” — characters, words, numbers or symbols produced as the model processes a problem — for complex problems and fewer for easy problems, suggesting that it learns to allocate more thinking time for harder problems.
But, Kambhampati wonders if the “thinking tokens,” even when clearly helping the model, provide any actual insight about its processing steps to the end user. He doesn’t think that the tokens correspond to some step-by-step solution of the problem. In DeepSeek-R1-Zero’s training process, every token that contributed to a correct answer gets rewarded, even if some intermediate steps the model took along the way to the correct answer were tangents or dead ends. This outcome-based reward model isn’t set up to reward only the productive portion of the model’s reasoning to encourage it to happen more often, he says. “So, it is strange to train the system only on the outcome reward model and delude yourself that it learned something about the process.”
Moreover, performance of AI models measured on benchmarks like a prestigious math competition’s dataset of problems are known to be inadequate indicators of how good the model is at problem-solving. “In general, telling whether a system is actually doing reasoning to solve the reasoning problem or using memory to solve the reasoning problem is impossible,” Kambhampati says. So, a static benchmark, with a fixed set of problems, can’t accurately convey a model’s reasoning ability since the model could have memorized the correct answers during its training on scraped internet data, he says.
AI researchers seem to understand that when they say LLMs are reasoning, they mean that they’re doing well on the reasoning benchmarks, Kambhampati says. But laypeople might assume that “if the models got the correct answer, then they must be following the right process,” he says. “Doing well on a benchmark versus using the process that humans might be using to do well in that benchmark are two very different things.” A lack of understanding of AI’s “reasoning” and an overreliance on such AI models could be risky, leading humans to accept AI decisions without critically thinking about the answers.
Some researchers are trying to get insights into how these models work and what training procedures are actually instilling information into the model, Jordan says, with a goal to reduce risk. But, as of now, the inner workings of how these AI models solve problems remains an open question.

AI科学News

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读