引入嵌套学习：一种持续学习的新机器学习范式

qimuai 发布于 2025-11-8 08:01 阅读：71 一手编译

内容来源：https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

内容总结：

谷歌研究团队于2025年11月7日发布名为"嵌套学习"的全新机器学习范式，该成果由谷歌研究员阿里·贝鲁兹与谷歌院士副总裁维哈布·米罗克尼共同提出，相关论文已发表于神经信息处理系统大会2025会议。

这项研究旨在突破当前大语言模型在持续学习过程中的核心瓶颈——"灾难性遗忘"现象，即模型在学习新任务时丢失原有技能的问题。研究团队创新性地将单一机器学习模型重构为多层级嵌套优化系统，通过揭示模型架构与训练算法本质上的统一性，开辟了人工智能设计新维度。

受人类大脑神经可塑性机制启发，研究团队开发出具有自修改能力的"Hope"架构作为概念验证。该架构采用连续记忆系统设计，将记忆模块按更新频率构建为连续谱系，在语言建模、长上下文推理等测试中表现出优于当前主流模型的性能，尤其在长文本信息检索任务中展现出卓越的记忆管理能力。

实验结果表明，基于嵌套学习理念的深度优化器能有效提升模型对不完善数据的适应能力，而连续记忆系统的引入则为实现更接近人脑的持续学习能力提供了可行路径。这项突破性研究为开发具备自主进化能力的新一代人工智能奠定了理论基础。

中文翻译：

嵌套学习：一种持续学习的新机器学习范式

2025年11月7日
Ali Behrouz（学生研究员）与 Vahab Mirrokni（副总裁兼谷歌研究员，谷歌研究院）

我们在此介绍"嵌套学习"——一种新的机器学习方法。它将模型视为一系列更小、相互嵌套的优化问题，每个问题都有其内部工作流程，旨在减轻甚至完全避免"灾难性遗忘"问题。所谓灾难性遗忘，即模型在学习新任务时，牺牲了对旧任务的熟练程度。

快速链接

过去十年，机器学习取得了惊人进展，这主要得益于强大的神经网络架构及其训练算法。然而，尽管大语言模型取得了成功，一些根本性挑战依然存在，尤其是在持续学习方面——即模型随时间推移主动获取新知识技能而不遗忘旧有能力。

在持续学习和自我改进方面，人脑是黄金标准。它通过神经可塑性来适应——这是一种根据新体验、记忆和学习改变自身结构的卓越能力。缺乏这种能力，个体将局限于即时情境（如顺行性遗忘）。我们在当前的大语言模型中也看到了类似的局限性：它们的知识要么局限于输入窗口的即时上下文，要么局限于预训练期间学到的静态信息。

简单地用新数据持续更新模型参数的方法，常常导致"灾难性遗忘"：学习新任务会牺牲对旧任务的熟练度。研究人员传统上通过调整架构或改进优化规则来对抗灾难性遗忘。然而，长久以来，我们一直将模型架构（网络结构）和优化算法（训练规则）视为两个独立的事物，这阻碍了我们实现真正统一、高效的学习系统。

在我们发表于NeurIPS 2025的论文《嵌套学习：深度学习架构的错觉》中，我们引入了"嵌套学习"以弥合这一鸿沟。嵌套学习不将单个机器学习模型视为一个连续过程，而是将其视为一个相互连接、多层次的学习问题系统，这些问题被同时优化。我们认为，模型架构和用于训练它的规则（即优化算法）本质上是同一概念；它们只是优化的不同"层级"，每一层都有其内部的信息流（"上下文流"）和更新速率。通过认识到这种内在结构，嵌套学习为设计更强大的人工智能提供了一个新的、此前不可见的维度，使我们能够构建具有更深计算深度的学习组件，这最终有助于解决诸如灾难性遗忘等问题。

我们通过一个概念验证性的、自修改的架构（我们称之为"Hope"）来测试和验证嵌套学习。该架构在语言建模任务中实现了卓越性能，并展现出比现有最先进模型更优的长上下文记忆管理能力。

嵌套学习范式

嵌套学习揭示出，一个复杂的机器学习模型实际上是一组连贯的、相互嵌套或并行运行的相互连接的优化问题。这些内部问题中的每一个都有其自身的上下文流——即它试图从中学习的一组独特信息。

这种视角意味着，现有的深度学习方法本质上是通过压缩其内部上下文流来工作的。更重要的是，嵌套学习揭示了一个新的模型设计维度，使我们能够构建具有更深计算深度的学习组件。

为了阐释这一范式，我们来看联想记忆的概念——即根据一件事物映射和回忆起另一件事物的能力（例如，看到一张脸时回忆起一个名字）。

我们证明，训练过程本身，特别是反向传播过程，可以被建模为一种联想记忆。模型学习将给定的数据点映射到其局部误差值，该误差值衡量了该数据点的"意外"或出乎意料程度。
同样，遵循先前的研究（例如Miras），关键的架构组件，如Transformer中的注意力机制，也可以被形式化为简单的联想记忆模块，学习序列中标记之间的映射关系。

通过定义更新频率（即每个组件的权重被调整的频率），我们可以将这些相互连接的优化问题排序为不同的"层级"。这个有序的集合构成了嵌套学习范式的核心。

应用嵌套学习

嵌套学习的视角立即为我们提供了改进现有算法和架构的原则性方法：

深度优化器
由于嵌套学习将优化器（例如基于动量的优化器）视为联想记忆模块，它使我们能够将联想记忆的视角原则应用于它们。我们观察到，许多标准优化器依赖于简单的点积相似性（一种通过计算两个向量对应分量乘积之和来衡量它们相似程度的度量），其更新并未考虑不同数据样本之间的相互关系。通过将优化器的基本目标更改为更标准的损失度量（例如L2回归损失，一种回归任务中常见的损失函数，通过计算预测值与真实值之差的平方和来量化误差），我们为核心概念（如动量）推导出了新的公式，使它们对不完美的数据更具韧性。

连续统记忆系统
在标准Transformer中，序列模型充当短期记忆，保存即时上下文，而前馈神经网络则充当长期记忆，存储预训练知识。嵌套学习范式将这一概念扩展为我们所谓的"连续统记忆系统"，其中记忆被视为一系列模块构成的谱系，每个模块以不同且特定的频率进行更新。这为持续学习创建了一个更丰富、更有效的记忆系统。

Hope：具备连续统记忆的自修改架构
作为概念验证，我们运用嵌套学习原理设计了Hope，它是Titans架构的一个变体。Titans架构是长期记忆模块，根据记忆的"意外"程度来确定其优先级。尽管它们具有强大的记忆管理能力，但只有两个层级的参数更新，导致其为一阶上下文学习。然而，Hope是一种自修改的循环架构，能够利用无限层级的上下文学习，并通过集成CMS模块来扩展到更大的上下文窗口。它本质上可以通过自引用过程来优化自身记忆，从而创建一个具有无限、循环学习层级的架构。

实验
我们进行了一系列实验，以评估我们深度优化器的有效性以及Hope在语言建模、长上下文推理、持续学习和知识融合任务上的性能。完整结果请参阅我们的论文。

结果
我们的实验证实了嵌套学习、连续统记忆系统设计以及自修改Titans架构的强大能力。

在各种常用且公开的语言建模和常识推理任务上，与现代循环模型和标准Transformer相比，Hope架构展现出更低的困惑度和更高的准确率。

Hope在长上下文"大海捞针"下游任务中展示了卓越的记忆管理能力，证明了连续统记忆系统提供了一种更高效、更有效的方式来处理扩展的信息序列。

结论
嵌套学习范式代表着我们在理解深度学习方面向前迈进了一步。通过将架构和优化视为一个单一的、连贯的嵌套优化问题系统，我们解锁了一个新的设计维度，可以堆叠多个层级。由此产生的模型，如Hope架构，表明采用原则性方法统一这些元素，能够催生出更具表现力、更强大且更高效的学习算法。

我们相信，嵌套学习范式为弥合当前大语言模型有限的、易遗忘的特性与人脑卓越的持续学习能力之间的差距，奠定了坚实的基础。我们期待研究界探索这一新维度，并帮助我们构建下一代自我改进的人工智能。

致谢
本研究由Ali Behrouz、Meisam Razaviyayn、Peilin Zhong和Vahab Mirrokni完成。我们感谢Praneeth Kacham和Corinna Cortes审阅本工作并提出宝贵建议。我们也感谢Yuan Deng和Zeman Li。最后，我们感谢Mark Simborg和Kimberly Schwede在撰写这篇博客文章过程中提供的帮助。

英文来源：

Introducing Nested Learning: A new ML paradigm for continual learning
November 7, 2025
Ali Behrouz, Student Researcher, and Vahab Mirrokni, VP and Google Fellow, Google Research
We introduce Nested Learning, a new approach to machine learning that views models as a set of smaller, nested optimization problems, each with its own internal workflow, in order to mitigate or even completely avoid the issue of “catastrophic forgetting”, where learning new tasks sacrifices proficiency on old tasks.
Quick links
The last decade has seen incredible progress in machine learning (ML), primarily driven by powerful neural network architectures and the algorithms used to train them. However, despite the success of large language models (LLMs), a few fundamental challenges persist, especially around continual learning, the ability for a model to actively acquire new knowledge and skills over time without forgetting old ones.
When it comes to continual learning and self-improvement, the human brain is the gold standard. It adapts through neuroplasticity — the remarkable capacity to change its structure in response to new experiences, memories, and learning. Without this ability, a person is limited to immediate context (like anterograde amnesia). We see a similar limitation in current LLMs: their knowledge is confined to either the immediate context of their input window or the static information that they learn during pre-training.
The simple approach, continually updating a model's parameters with new data, often leads to “catastrophic forgetting” (CF), where learning new tasks sacrifices proficiency on old tasks. Researchers traditionally combat CF through architectural tweaks or better optimization rules. However, for too long, we have treated the model's architecture (the network structure) and the optimization algorithm (the training rule) as two separate things, which prevents us from achieving a truly unified, efficient learning system.
In our paper, “Nested Learning: The Illusion of Deep Learning Architectures”, published at NeurIPS 2025, we introduce Nested Learning, which bridges this gap. Nested Learning treats a single ML model not as one continuous process, but as a system of interconnected, multi-level learning problems that are optimized simultaneously. We argue that the model's architecture and the rules used to train it (i.e., the optimization algorithm) are fundamentally the same concepts; they are just different "levels" of optimization, each with its own internal flow of information ("context flow") and update rate. By recognizing this inherent structure, Nested Learning provides a new, previously invisible dimension for designing more capable AI, allowing us to build learning components with deeper computational depth, which ultimately helps solve issues like catastrophic forgetting.
We test and validate Nested Learning through a proof-of-concept, self-modifying architecture that we call “Hope”, which achieves superior performance in language modeling and demonstrates better long-context memory management than existing state-of-the-art models.
The Nested Learning paradigm
Nested Learning reveals that a complex ML model is actually a set of coherent, interconnected optimization problems nested within each other or running in parallel. Each of these internal problems has its own context flow — its own distinct set of information from which it is trying to learn.
This perspective implies that existing deep learning methods work by essentially compressing their internal context flows. More importantly, Nested Learning reveals a new dimension for designing models, allowing us to build learning components with deeper computational depth.
To illustrate this paradigm, we look at the concept of associative memory — the ability to map and recall one thing based on another (like recalling a name when you see a face).

We show that the training process itself, specifically the backpropagation process, can be modeled as an associative memory. The model learns to map a given data point to the value of its local error, which serves as a measure of how "surprising" or unexpected that data point was.
Similarly, following previous studies (e.g., Miras), key architectural components, such as the attention mechanism in transformers, can also be formalized as simple associative memory modules that learn the mapping between tokens in a sequence.
By defining an update frequency rate, i.e., how often each component's weights are adjusted, we can order these interconnected optimization problems into "levels." This ordered set forms the heart of the Nested Learning paradigm.
Putting Nested Learning to work
The Nested Learning perspective immediately gives us principled ways to improve existing algorithms and architectures:
Deep optimizers
Since Nested Learning views optimizers (e.g., momentum-based optimizers) as associative memory modules, it allows us to apply principles from associative memory perspective to them. We observed that many standard optimizers rely on simple dot-product similarity (a measure of how alike two vectors are by calculating the sum of the products of their corresponding components) whose update doesn't account for how different data samples relate to each other. By changing the underlying objective of the optimizer to a more standard loss metric, such as L2 regression loss (a common loss function in regression tasks that quantifies the error by summing the squares of the differences between predicted and true values), we derive new formulations for core concepts like momentum, making them more resilient to imperfect data.
Continuum memory systems
In a standard Transformer, the sequence model acts as a short-term memory, holding the immediate context, while the feedforward neural networks act as long-term memory, storing pre-training knowledge. The Nested Learning paradigm extends this concept into what we call a “continuum memory system” (CMS), where memory is seen as a spectrum of modules, each updating at a different, specific frequency rate. This creates a much richer and more effective memory system for continual learning.
Hope: A self-modifying architecture with continuum memory
As a proof-of-concept, we used Nested Learning principles to design Hope, a variant of the Titans architecture. Titans architectures are long-term memory modules that prioritize memories based on how surprising they are. Despite their powerful memory management, they only have two levels of parameters update, resulting in a first-order in-context learning. Hope, however, is a self-modifying recurrent architecture that can take advantage of unbounded levels of in-context learning and also is augmented with CMS blocks to scale to larger context windows. It can essentially optimize its own memory through a self-referential process, creating an architecture with infinite, looped learning levels.
Experiments
We conducted experiments to evaluate the effectiveness of our deep optimizers and the performance of Hope on language modeling, long-context reasoning, continual learning, and knowledge incorporation tasks. The full results are available in our paper.
Results
Our experiments confirm the power of Nested Learning, the design of continuum memory systems, and self-modifying Titans.
On a diverse set of commonly used and public language modeling and common-sense reasoning tasks, the Hope architecture demonstrates lower perplexity and higher accuracy compared to modern recurrent models and standard transformers.
Hope showcases superior memory management in long-context Needle-In-Haystack (NIAH) downstream tasks, proving that the CMSs offer a more efficient and effective way to handle extended sequences of information.
Conclusion
The Nested Learning paradigm represents a step forward in our understanding of deep learning. By treating architecture and optimization as a single, coherent system of nested optimization problems, we unlock a new dimension for design, stacking multiple levels. The resulting models, like the Hope architecture, show that a principled approach to unifying these elements can lead to more expressive, capable, and efficient learning algorithms.
We believe the Nested Learning paradigm offers a robust foundation for closing the gap between the limited, forgetting nature of current LLMs and the remarkable continual learning abilities of the human brain. We are excited for the research community to explore this new dimension and help us build the next generation of self-improving AI.
Acknowledgements
This research was conducted by Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. We thank Praneeth Kacham and Corinna Cortes for reviewing the work and their valuable suggestions. We also thank Yuan Deng and Zeman Li. Finally, we thank Mark Simborg and Kimberly Schwede for their help in crafting this blog post.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读