«

该人工智能模型能直观理解物理世界的运作原理。

qimuai 发布于 阅读:25 一手编译


该人工智能模型能直观理解物理世界的运作原理。

内容来源:https://www.wired.com/story/how-one-ai-model-creates-a-physical-intuition-of-its-environment/

内容总结:

【科技前沿】AI模型学会“直觉”:像婴儿一样理解物体恒存性

近期,一项人工智能研究取得突破性进展:由Meta公司开发的视频联合嵌入预测架构(V-JEPA)模型,通过观看海量视频自主学习,首次展现出类似人类婴儿的物理直觉能力。该模型不仅能理解物体恒存性、重力作用等基础物理规律,还会对违背常识的画面产生“惊讶”反应。

传统AI视频理解系统通常依赖像素级分析,容易受无关细节干扰。而V-JEPA采用“潜在表征”抽象化处理机制——如同将数千像素信息浓缩为描述物体高度、位置等核心特征的几个数字,使模型能聚焦关键信息,忽略枝叶晃动等无关变化。这种设计让AI能够更高效地捕捉视频中的逻辑关系。

在专门测试物理常识的IntPhys基准评估中,V-JEPA对视频中物理合理性的判断准确率高达98%,远超基于像素预测的传统模型。研究团队还量化了模型的“惊讶度”:当小球滚入遮挡物后未按预期重现时,系统会产生显著预测误差,其反应模式与婴幼儿的直觉表现高度相似。

“这项研究证明了AI无需预设大量先天知识,仅通过观察就能掌握基础物理规律。”阿姆斯特丹大学认知科学家米夏·海尔布隆评价道。不过伦敦大学学院神经科学家卡尔·弗里斯顿指出,当前模型尚缺乏对不确定性的量化能力。

今年6月,Meta团队已发布参数达12亿的V-JEPA 2.0模型,并成功应用于机器人操控任务。新模型仅需约60小时机器人操作数据微调,即可规划简单动作。但测试显示,面对更复杂的物理场景推理时,其表现仍接近随机猜测——研究团队坦言,模型现有记忆时长仅数秒,“堪比金鱼的记忆”。

这项突破标志着AI在理解现实世界规律方面迈出关键一步,为开发更智能的自主系统奠定基础。随着技术迭代,未来或能创造出具备更持久记忆与推理能力的人工智能。

中文翻译:

本文原载于《量子杂志》。

有一个针对婴儿的测试:在桌上放一杯水,用木板遮住,然后移动木板靠近杯子。如果木板直接穿过杯子,仿佛杯子不存在,婴儿会惊讶吗?许多6个月大的婴儿会表现出惊讶;而到一岁左右,几乎所有儿童都能通过观察形成对物体恒存性的直觉认知。如今,某些人工智能模型也具备了这种能力。

研究人员开发出一套通过视频学习世界知识的人工智能系统,当接收的信息与其已掌握的知识相悖时,该系统会表现出"惊讶"反应。

这个由Meta公司创建的模型名为"视频联合嵌入预测架构"(V-JEPA)。它并不预设视频中蕴含的物理规律,却能逐渐理解世界的运行方式。

"他们的主张从理论上看非常可信,实验结果也极具启发性,"阿姆斯特丹大学研究大脑与人工系统如何理解世界的认知科学家米夏·海尔布隆评价道。

更高层次的抽象

自动驾驶汽车的工程师们深知,让AI系统可靠地理解所见内容十分困难。大多数旨在"理解"视频的系统——无论是用于分类内容(例如"打网球的人")还是识别物体轮廓(如前方车辆)——都工作在所谓"像素空间"中。这类模型本质上将视频中的每个像素视为同等重要。

但像素空间模型存在局限。试想理解一条郊区街道的场景:若画面中出现汽车、交通信号灯和树木,模型可能过度关注树叶晃动等无关细节,却忽略信号灯颜色或周边车辆位置。"处理图像或视频时,不应在像素空间操作,因为存在太多无需建模的细节,"布朗大学计算机科学家兰德尔·巴莱斯特里罗指出。

2024年发布的V-JEPA架构正是为规避这些问题而设计。尽管构成V-JEPA的各类人工神经网络具体结构复杂,但其核心理念却很简单。

传统像素空间系统通过遮蔽视频帧中的部分像素,并训练神经网络预测这些被遮蔽像素值来完成训练。V-JEPA同样会遮蔽视频帧区域,但它不在单个像素层面预测遮蔽内容,而是运用更高层次的抽象表征——即"潜在表征"——来建模内容。

潜在表征仅捕捉数据的核心特征。例如,面对各种圆柱体的线稿图,一种称为编码器的神经网络可将每幅图像转化为代表圆柱基本属性的数值,如高度、宽度、朝向和位置。通过这种方式,数百乃至数千像素包含的信息被压缩为少量数值——即潜在表征。随后,另一种称为解码器的神经网络会学习将这些核心特征还原为圆柱图像。

V-JEPA专注于创建和复现潜在表征。其架构大体分为三部分:编码器1、编码器2和预测器。训练算法首先选取一组视频帧,遮蔽所有帧中相同的像素区域,将处理后的帧输入编码器1(有时视频最后几帧会被完全遮蔽)。编码器1将被遮蔽帧转化为潜在表征。同时,算法将未经遮蔽的完整帧输入编码器2,生成另一组潜在表征。

接着预测器开始工作:它根据编码器1生成的潜在表征,预测编码器2的输出结果。本质上,这是通过遮蔽帧的潜在表征来预测未遮蔽帧的潜在表征。通过重建相关潜在表征(而非早期系统那样还原缺失像素),模型学会关注道路上的车辆,而非纠结于树叶的晃动。

"这使模型能摒弃无关信息,聚焦视频更重要的方面,"Meta研究科学家昆汀·加里多解释道,"舍弃不必要信息至关重要,正是V-JEPA致力高效实现的目标。"

完成预训练阶段后,下一步是针对特定任务定制V-JEPA,如图像分类或视频动作识别。这种适配阶段需要少量人工标注数据,例如为视频添加动作标签。相较于为特定下游任务端到端训练整个系统,最终任务的适配所需标注数据量大幅减少。此外,同一套编码器和预测器网络可适配不同任务。

直觉模拟

今年2月,V-JEPA团队报告了其系统在理解现实世界直觉物理属性——如物体恒存性、形状颜色恒常性、重力与碰撞效应——方面的表现。在名为IntPhys的测试中(要求AI模型判断视频中的动作是否符合物理规律),V-JEPA准确率接近98%。而著名的像素空间预测模型准确率仅略高于随机猜测。

团队还量化了模型预测与观察不符时表现的"惊讶"程度:他们将预训练过的V-JEPA模型输入新视频,通过数学计算比较模型对后续帧的预测与实际画面的差异。结果发现,当后续帧出现物理上不可能的事件时,预测误差急剧上升。例如,若球体滚入遮蔽物后暂时消失,而后续帧中球体未重新出现,模型就会产生误差——这种反应类似于婴儿的直觉响应。可以说,V-JEPA"感到惊讶"了。

海尔布隆对V-JEPA的能力印象深刻:"发展心理学研究表明,婴儿无需大量接触就能学习这类直觉物理知识。他们证明这种能力本身是可学习的,且不需要预设大量先天认知前提,这很有说服力。"

伦敦大学学院计算神经科学家卡尔·弗里斯顿认为,V-JEPA在模拟"大脑学习和建模世界的方式"方面方向正确,但仍缺乏某些基本要素:"当前方案缺失的是对不确定性的恰当编码。"例如,若过往帧信息不足以准确预测未来帧,预测就会存在不确定性,而V-JEPA未能量化这种不确定性。

6月,Meta的V-JEPA团队发布了新一代拥有12亿参数的V-JEPA 2模型,该模型基于2200万段视频进行预训练。团队还将模型应用于机器人领域:他们演示了如何仅用约60小时机器人数据(包括机器人视频及其动作信息)进一步微调新的预测器网络,随后用微调后的模型规划机器人下一步动作。"此类模型能完成简单的机器人操控任务,为未来研究铺平道路,"加里多表示。

为推进V-JEPA 2发展,团队设计了更困难的直觉物理理解基准测试IntPhys 2。面对这些更严峻的测试,V-JEPA 2及其他模型的表现仅略优于随机猜测。加里多指出,部分原因在于V-JEPA 2仅能处理数秒视频输入并预测未来数秒,更长的内容会被遗忘。虽然可再次类比婴儿认知,但加里多联想到另一种生物:"某种意义上,模型的记忆让人联想到金鱼。"

本文经《量子杂志》授权转载。《量子杂志》是西蒙斯基金会旗下独立编辑的出版物,其宗旨是通过报道数学、物理与生命科学的研究进展和趋势,提升公众对科学的理解。

英文来源:

The original version of this story appeared in Quanta Magazine.
Here’s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren’t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object’s permanence, learned through observation. Now some artificial intelligence models do too.
Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of “surprise” when presented with information that goes against the knowledge it has gleaned.
The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.
“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.
Higher Abstractions
As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to “understand” videos in order to either classify their content (“a person playing tennis,” for example) or identify the contours of an object—say, a car up ahead—work in what’s called “pixel space.” The model essentially treats every pixel in a video as equal in importance.
But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model,” said Randall Balestriero, a computer scientist at Brown University.
The V-JEPA architecture, released in 2024, is designed to avoid these problems. While the specifics of the various artificial neural networks that comprise V-JEPA are complex, the basic concept is simple.
Ordinary pixel-space systems go through a training process that involves masking some pixels in the frames of a video and training neural networks to predict the values of those masked pixels. V-JEPA also masks portions of video frames. But it doesn’t predict what’s behind the masked regions at the level of individual pixels. Rather, it uses higher levels of abstractions, or “latent” representations, to model the content.
Latent representations capture only essential details about data. For example, given line drawings of various cylinders, a neural network called an encoder can learn to convert each image into numbers representing fundamental aspects of each cylinder, such as its height, width, orientation and location. By doing so, the information contained in hundreds or thousands of pixels is converted into a handful of numbers—the latent representations. A separate neural network called a decoder then learns to convert the cylinder’s essential details into an image of the cylinder.
V-JEPA focuses on creating and reproducing latent representations. At a high level, the architecture is split into three parts: encoder 1, encoder 2, and a predictor. First, the training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Sometimes, the final few frames of the video are fully masked. Encoder 1 converts the masked frames into latent representations. The algorithm also feeds the unmasked frames in their entirety into encoder 2, which converts them into another set of latent representations.
Now the predictor gets into the act. It uses the latent representations produced by encoder 1 to predict the output of encoder 2. In essence, it takes latent representations generated from masked frames and predicts the latent representations generated from the unmasked frames. By re-creating the relevant latent representations, and not the missing pixels of earlier systems, the model learns to see the cars on the road and not fuss about the leaves on the trees.
“This enables the model to discard unnecessary … information and focus on more important aspects of the video,” said Quentin Garrido, a research scientist at Meta. “Discarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.”
Once this pretraining stage is complete, the next step is to tailor V-JEPA to accomplish specific tasks such as classifying images or identifying actions depicted in videos. This adaptation phase requires some human-labeled data. For example, videos have to be tagged with information about the actions contained in them. The adaptation for the final tasks requires much less labeled data than if the whole system had been trained end to end for specific downstream tasks. In addition, the same encoder and predictor networks can be adapted for different tasks.
Intuition Mimic
In February, the V-JEPA team reported how their systems did at understanding the intuitive physical properties of the real world—properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called IntPhys, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98 percent accurate. A well-known model that predicts in pixel space was only a little better than chance.
The V-JEPA team also explicitly quantified the “surprise” exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn’t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants. V-JEPA, one could say, was surprised.
Heilbron is impressed by V-JEPA’s ability. “We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics,” he said. “It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors.”
Karl Friston, a computational neuroscientist at University College London, thinks that V-JEPA is on the right track in terms of mimicking the “way our brains learn and model the world.” However, it still lacks some fundamental elements. “What is missing from [the] current proposal is a proper encoding of uncertainty,” he said. For example, if the information in the past frames isn’t enough to accurately predict the future frames, the prediction is uncertain, and V-JEPA doesn’t quantify this uncertainty.
In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, V-JEPA 2, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot’s next action. “Such a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,” Garrido said.
To push V-JEPA 2, the team designed a more difficult benchmark for intuitive physics understanding, called IntPhys 2. V-JEPA 2 and other models did only slightly better than chance on these tougher tests. One reason, Garrido said, is that V-JEPA 2 can handle only about a few seconds of video as input and predict a few seconds into the future. Anything longer is forgotten. You could make the comparison again to infants, but Garrido had a different creature in mind. “In a sense, the model’s memory is reminiscent of a goldfish,” he said.
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.

连线杂志AI最前沿

文章目录


    扫描二维码,在手机上阅读