教会大型语言模型像贝叶斯学派一样推理

qimuai 发布于 2026-3-5 08:01 阅读：11 一手编译

教会大型语言模型像贝叶斯学派一样推理

内容来源：https://research.google/blog/teaching-llms-to-reason-like-bayesians/

内容总结：

谷歌研究团队突破性进展：通过“贝叶斯教学”让大语言模型学会概率推理

2026年3月4日，谷歌研究院科学家Sjoerd van Steenkiste与Tal Linzen团队公布了一项创新研究成果。他们成功通过一种名为“贝叶斯教学”的监督微调方法，训练大语言模型（LLM）模仿最优贝叶斯模型的预测行为，从而使其掌握了类似贝叶斯推理的核心能力。

当前，基于大语言模型的AI系统正越来越多地作为智能体与用户及世界互动。要成功完成此类任务，模型需要构建对世界的内在表征，并评估这些表征的准确概率。例如，在进行个性化推荐时，模型需通过多次互动，从用户的选择中逐步推断其偏好。贝叶斯推理为此类概率更新定义了最优方法。然而，未经专门训练的LLM在处理此类任务时，往往依赖简单启发式方法（如默认用户总选择最便宜选项），而无法有效推断特定用户的独特偏好。

为验证并提升LLM的贝叶斯推理能力，研究团队设计了一个简化的航班推荐实验。在模拟的五轮互动中，LLM助手需要根据用户对三个航班选项（包含出发时间、时长、经停次数和成本等特征）的选择历史，来推荐符合用户偏好的航班。团队将多个主流LLM的表现与严格遵循贝叶斯最优策略的“贝叶斯助手”模型进行对比。结果显示，未经微调的原始LLM表现显著逊色于贝叶斯助手。更重要的是，贝叶斯助手能随着获得更多用户信息而持续改进推荐，而LLM的表现往往在第一轮互动后便陷入停滞，显示出其适应新信息、进行持续概率更新的能力有限。

针对这一缺陷，研究团队提出了“贝叶斯教学”框架。该框架的核心是通过监督微调，让LLM学习如何像贝叶斯推理那样，根据新证据将“先验信念”更新为“后验信念”。团队探索了两种生成微调数据的策略：一是“先知教学”，即让LLM学习一个始终知道用户正确答案的“先知助手”的互动数据；二是“贝叶斯教学”，即让LLM学习“贝叶斯助手”（一个在不确定性中做出最佳猜测的模型）与用户的互动数据。

实验结果令人振奋：

性能显著提升：两种微调策略都大幅提升了LLM在推荐任务上的表现，其中“贝叶斯教学”的效果 consistently优于“先知教学”。
逼近数学理想：经过贝叶斯教学微调的LLM，其预测与贝叶斯助手的一致性最高可达80%。模型学会了更现实地权衡信息，当用户选择能更清晰揭示偏好时，会给予更高权重。
具备跨领域泛化能力：最关键的是，这种习得的推理技能并非任务特定。在合成航班数据上训练的模型，成功将其“概率逻辑”迁移到了完全不同的领域，如酒店推荐和真实网络购物场景。这表明LLM能够内化贝叶斯推理的核心原则，从静态的模式匹配者，转变为能够进行跨领域推理的自适应智能体。

这项研究不仅揭示了当前LLM在形成和更新概率信念方面的固有局限，更通过成功的微调实验，有力证明了“训练后”范式的强大潜力。通过将经典的符号模型（贝叶斯推理）蒸馏到神经网络中，LLM能够学会近似最优的概率推理策略，并将其应用于难以用显式符号规则编码的复杂现实领域。这为开发更具适应性、推理能力和用户理解深度的下一代AI助手开辟了新的道路。

中文翻译：

教会大语言模型像贝叶斯派一样推理

2026年3月4日
Sjoerd van Steenkiste 与 Tal Linzen，谷歌研究院科学家

我们通过训练大语言模型（LLM）模仿最优贝叶斯模型的预测，来教会它们以贝叶斯方式进行推理。

快速链接

基于大语言模型（LLM）的人工智能系统正越来越多地被用作与用户和世界交互的智能体。要成功做到这一点，LLM需要构建对世界的内部表征，并估计每种表征准确无误的概率。以个性化推荐为例：LLM需要在多次交互过程中，从用户的选择中逐步推断出用户的偏好。

贝叶斯推理定义了执行此类更新的最优方式。通过实施这一策略，当收到关于用户的新信息时，LLM可以通过更新其对用户偏好的估计来优化用户交互。但在没有专门训练的情况下，LLM通常会默认使用简单的启发式方法——例如假设每个人都想要最便宜的选择——而不是去推断特定用户的独特偏好。

在《贝叶斯教学使大语言模型能够进行概率推理》一文中，我们通过训练LLM模仿贝叶斯模型的预测，来教会它们以贝叶斯方式进行推理，该模型定义了关于概率的最优推理方式。我们发现，这种方法不仅显著提高了LLM在其受训的特定推荐任务上的表现，还使其能够泛化到其他任务。这表明该方法教会了LLM更好地近似贝叶斯推理。更广泛地说，我们的结果表明，LLM可以有效地从示例中学习推理技能，并将这些技能泛化到新领域。

评估LLM的贝叶斯能力

与人一样，为了有效交互，LLM需要根据与用户的每一次新互动，持续更新其对用户偏好的概率估计。这里我们要问：LLM的行为是否表现得好像拥有概率估计，并且这些估计会像最优贝叶斯推理所预期的那样进行更新？在LLM行为偏离最优贝叶斯策略的程度上，我们如何能最小化这些偏差？

为了测试这一点，我们使用了一个简化的航班推荐任务，其中LLM作为助手与模拟用户进行五轮互动。在每一轮中，向用户和助手展示三个航班选项。每个航班由起飞时间、飞行时长、经停次数和费用定义。每个模拟用户都有一组偏好特征：对于每个特征，他们可能对高值或低值有强烈或微弱的偏好（例如，他们可能偏好更长或更短的航班），或者对该特征没有偏好。

我们将LLM的行为与遵循最优贝叶斯策略的模型——贝叶斯助手——的行为进行了比较。该模型维护着一个反映其对用户偏好估计的概率分布，并利用贝叶斯规则，在获得关于用户选择的新信息时更新该分布。与许多现实场景中难以计算性地指定和实施贝叶斯策略不同，在这个受控环境中，它很容易实施，并允许我们精确估计LLM偏离该策略的程度。

助手的目标是推荐与用户选择相匹配的航班。在每一轮结束时，用户向助手表明其推荐是否正确，并提供正确答案。

我们评估了一系列LLM，发现它们的表现都显著差于最优的贝叶斯助手。最重要的是，与贝叶斯助手随着获得更多关于用户选择的信息而逐步改进其推荐不同，LLM的表现通常在单次交互后就趋于稳定，这表明其适应新信息的能力有限，并且在多次与用户交互中表现出有限或没有改进。

我们将来自不同模型系列的现成LLM与人类参与者和贝叶斯助手进行了比较。LLM的表现比贝叶斯助手差得多。人类参与者在获得更多信息时表现出比大多数LLM更大的改进，但他们仍然达不到最优贝叶斯策略所特有的准确性。

贝叶斯教学框架

在贝叶斯框架中，智能体维持着关于世界状态的先验信念。对于LLM来说，这个“世界状态”是其对事实、关系和概念的内部表征。当模型遇到新信息（证据）时，它需要将其先验信念（或称“先验”，即在看到新证据前对某事的初始猜测或概率）转化为“后验信念”（即纳入新数据后更新的概率），该后验信念将作为下一个证据的新先验。这个循环过程使得智能体能够不断完善其对世界的理解。

挑战在于教会模型如何执行这些概率更新。我们通过监督微调来实现这一点，即让模型根据其观察到的大量用户交互来更新其参数。

我们探索了两种策略来创建监督微调数据。在第一种策略中（我们称之为“先知教学”），我们向LLM提供模拟用户与“先知”助手之间的交互，该助手完全了解用户的偏好，因此总是推荐与用户选择完全一致的选项。

第二种策略，我们称之为“贝叶斯教学”，向LLM提供贝叶斯助手与用户之间的交互。在这种设置下，助手经常选择与用户偏好选择不匹配的航班，尤其是在早期轮次中，当时对用户偏好存在相当大的不确定性。我们假设，模仿贝叶斯助手的最佳猜测，会比“先知教学”（LLM在正确选择上进行训练）更能教会LLM保持不确定性并更有效地更新其信念。这种方法可以看作是一种“蒸馏”形式，即通过学会模仿另一个系统来训练模型。

结果

监督微调教会了LLM近似概率推理。我们检查了不同助手在第一轮和最终（第五轮）后的准确性。我们比较了原始LLM、在贝叶斯助手与用户交互数据上微调的LLM，以及在“先知”助手（总是提供正确答案）与用户交互数据上微调的LLM。两种微调都显著提高了LLM的表现，并且贝叶斯教学始终比先知教学更有效。

使用贝叶斯教学微调的LLM与贝叶斯助手的一致性更高，并且能泛化到微调所用任务之外的领域。我们展示了LLM与贝叶斯助手之间的一致性，通过LLM做出与贝叶斯助手相同预测的试验比例来衡量。在贝叶斯助手的预测上进行微调使得LLM更具贝叶斯特性，每个LLM的贝叶斯版本与贝叶斯助手的一致性最高。我们还考察了LLM在网络购物领域（微调期间未见过）的最终轮次准确性。下图中的绿色虚线表示LLM直接在网络购物数据上微调时的性能，这种情况下无需领域泛化，但可能更难获得。

贝叶斯教学显著优于先知教学，使模型在80%的情况下与数学理想模型保持一致。这些经过微调的模型对信息形成了现实的敏感性，学会了在特定用户选择能更清晰地揭示偏好时，对这些选择赋予更大的权重。

关键的是，这些新获得的技能并非任务特定的。在合成航班数据上训练的模型成功地将它们的“概率逻辑”迁移到了完全不同的领域，例如酒店推荐和现实世界的网络购物。这表明LLM可以内化贝叶斯推理的核心原则，从静态的模式匹配器转变为能够进行跨领域推理的自适应智能体。

贝叶斯教学的未来方向

我们测试了一系列LLM，发现它们在形成和更新概率信念方面存在困难。我们进一步发现，通过让LLM接触用户与贝叶斯助手（一个实施了最优概率信念更新策略的模型）之间的交互来继续训练，极大地提高了LLM近似概率推理的能力。

虽然我们第一个实验的结果指出了特定LLM的局限性，但我们后续微调实验的积极发现可以被更普遍地视为LLM“后训练”范式优势的证明。通过在最优任务执行策略的演示上训练LLM，我们能够显著提高其性能，这表明它们学会了近似演示所展示的概率推理策略。LLM能够将这一策略泛化到难以在符号模型中明确编码的领域，这展示了将经典符号模型“蒸馏”到神经网络中的强大能力。

英文来源：

Teaching LLMs to reason like Bayesians
March 4, 2026
Sjoerd van Steenkiste and Tal Linzen, Research Scientists, Google Research
We teach LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
Quick links
AI systems based on large language models (LLMs) are increasingly used as agents that interact with users and the world. To do this successfully, LLMs need to construct internal representations of the world and estimate the probability that each of these representations is accurate. Take personalized recommendations, for example: the LLM needs to gradually infer the user’s preferences from their choices over the course of multiple interactions.
Bayesian inference defines the optimal way to perform such updates. By implementing this strategy, LLMs could optimize user interactions by updating their estimates of the user’s preferences as new info about the user arrives. But without specific training, LLMs often default to simple heuristics — like assuming everyone wants the cheapest option — instead of inferring a specific user's unique preferences.
In “Bayesian teaching enables probabilistic reasoning in large language models”, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the Bayesian model, which defines the optimal way to reason about probabilities. We find that this approach not only significantly improves the LLM’s performance on the particular recommendation task on which it is trained, but also enables generalization to other tasks. This suggests that this method teaches the LLM to better approximate Bayesian reasoning. More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
Evaluating LLMs’ Bayesian capabilities
As with humans, to be effective, an LLM’s user interactions require continual updates to its probabilistic estimates of the user’s preferences based on each new interaction with them. Here we ask: do LLMs act as if they have probabilistic estimates that are updated as expected from optimal Bayesian inference? To the extent that the LLM’s behavior deviates from the optimal Bayesian strategy, how can we minimize these deviations?
To test this, we used a simplified flight recommendation task, in which the LLMs interact as assistants with a simulated user for five rounds. In each round, three flight options were presented to both the user and the assistant. Each flight was defined by a departure time, a duration, a number of stops, and a cost. Each simulated user was characterized by a set of preferences: for each feature, they could have a strong or weak preference for high or low values of the feature (e.g., they may prefer longer or shorter flights), or no preference regarding this feature.
We compared the LLMs’ behavior to that of a model, a Bayesian assistant, that follows the optimal Bayesian strategy. This model maintains a probability distribution that reflects its estimates of the user’s preferences, and uses Bayes’ rule to update this distribution as new information about the user’s choices becomes available. Unlike many real-life scenarios, where it’s difficult to specify and implement the Bayesian strategy computationally, in this controlled setting it’s easy to implement and allows us to precisely estimate the extent to which LLMs deviate from it.
The goal of the assistant was to recommend the flight that matches the user’s choice. At the end of each round, the user indicated to the assistant whether or not it chose correctly, and provided it with the correct answer.
We evaluated a range of LLMs and found that they all performed significantly worse than the optimal Bayesian Assistant. Most importantly, in contrast to the Bayesian Assistant, which gradually improved its recommendations as it received additional information about the user’s choices, LLMs’ performance often plateaued after a single interaction, pointing to a limited ability to adapt to new information and showing limited or no improvement over multiple interactions with the user.
We compared off-the-shelf LLMs from different model families to human participants and the Bayesian Assistant. The LLMs performed considerably worse than the Bayesian Assistant. Human participants demonstrated a greater improvement than most LLMs as they received more information, but they still fell short of the accuracy that characterizes the optimal Bayesian strategy.
The Bayesian teaching framework
In the Bayesian framework, an agent maintains a prior belief about the state of the world. For an LLM, this "world state" is its internal representation of facts, relationships, and concepts. As the model encounters new information (evidence), it needs to convert its prior belief (or “prior”, the initial guess or probability for something before seeing new evidence) into a “posterior belief” (the updated probability after incorporating new data) that serves as the new prior for the next piece of evidence. This cyclical process allows the agent to continuously refine its understanding of the world.
The challenge is teaching the model how to perform these probabilistic updates. We did this through supervised fine-tuning, where we had the model update its parameters based on a large number of interactions it observed with users.
We explored two strategies to create supervised fine-tuning data. In the first strategy, which we refer to as Oracle teaching, we provided the LLM with interactions between simulated users and an “oracle” assistant that has perfect knowledge of the user’s preferences, and as such always recommends the option that is identical to the user’s choices.
The second strategy, which we call Bayesian teaching, provided the LLM with interactions between the Bayesian Assistant and the user. In this setting, the assistant often chose flights that did not match the user’s preferred choice, especially in early rounds where there was considerable uncertainty about the user’s preferences. We hypothesized that mimicking the Bayesian Assistant’s best guesses would teach the LLM to maintain uncertainty and update its beliefs more effectively than Oracle teaching, where the LLM is trained on the correct choices. This approach can be seen as a form of distillation, where a model is trained by learning to mimic another system.
Results
Supervised fine-tuning teaches LLMs to approximate probabilistic inference. We examined the accuracy after the first round and final (fifth) round across different assistants. We compared the original LLMs, LLMs fine-tuned on user interactions with the Bayesian Assistant, and LLMs fine-tuned on user interactions with an oracle, which always provided the correct answer. Both types of fine-tuning significantly improved LLMs’ performance, and Bayesian teaching was consistently more effective than oracle teaching.
Fine-tuned LLMs using Bayesian teaching agreed more with the Bayesian Assistant, and generalized outside the task used for fine-tuning. We showed agreement between the LLMs and the Bayesian Assistant, measured by the proportion of trials where the LLMs made the same predictions as the Bayesian Assistant. Fine-tuning on the Bayesian Assistant’s predictions made the LLMs more Bayesian, with the Bayesian versions of each LLM achieving the highest agreement with the Bayesian Assistant. We also looked at the final-round accuracy for LLMs on the web shopping domain, which was unseen during fine-tuning. The green dashed line in the figure below indicates the performance of the LLM when it was fine-tuned directly on web shopping data, such that no domain generalization was necessary, but which might be more difficult to obtain.
Bayesian teaching significantly outperformed Oracle teaching, enabling models to agree with mathematical ideals 80% of the time. These fine-tuned models developed a realistic sensitivity to information, learning to weigh specific user choices more heavily when those choices revealed clearer preferences.
Crucially, these newly acquired skills were not task-specific. Models trained on synthetic flight data successfully transferred their "probabilistic logic" to entirely different domains, such as hotel recommendations and real-world web shopping. This suggests that LLMs can internalize the core principles of Bayesian inference, transforming from static pattern-matchers into adaptive agents capable of cross-domain reasoning.
What’s next for Bayesian teaching?
We tested a range of LLMs and found that they struggled to form and update probabilistic beliefs. We further found that continuing the LLMs’ training through exposure to interactions between users and the Bayesian Assistant — a model that implements the optimal probabilistic belief update strategy — dramatically improved the LLMs’ ability to approximate probabilistic reasoning.
While our findings from our first experiment point to the limitations of particular LLMs, the positive findings of our subsequent fine-tuning experiments can be viewed as a demonstration of the strength of the LLM “post-training” paradigm more generally. By training the LLMs on demonstrations of the optimal strategy to perform the task, we were able to improve their performance considerably, suggesting that they learned to approximate the probabilistic reasoning strategy illustrated by the demonstrations. The LLMs were able to generalize this strategy to domains where it is difficult to encode it explicitly in a symbolic model, demonstrating the power of distilling a classic symbolic model into a neural network.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读