让机器证明数学定理：化难题为益智游戏

qimuai 发布于 2025-11-11 01:02 阅读：55 一手编译

让机器证明数学定理：化难题为益智游戏

内容来源：https://www.quantamagazine.org/to-have-machines-make-math-proofs-turn-them-into-a-puzzle-20251110/

内容总结：

卡内基梅隆大学数学家马里jn·海勒近日提出突破性研究路径：通过将复杂数学证明转化为类似数独的逻辑拼图，让人工智能系统攻克人类无法独立解决的数学难题。

海勒团队采用名为"可满足性"（SAT）的经典人工智能方法，已成功解决几何学与组合数学领域多个悬置近百年的难题，包括"舒尔数第五问题"和"七维凯勒猜想"。SAT技术通过将问题分解为真/假判断的逻辑链，构建出人类难以完整验证但绝对严谨的自动化证明。

研究显示，当前大语言模型虽能在国际数学奥林匹克竞赛中获奖，但仅能解决人类可解问题。海勒指出，SAT技术的独特价值在于"已成功解决多个不存在人类证明的难题"。目前他正推动SAT与大语言模型的融合——由大语言模型进行高层次数学构思，SAT负责逻辑验证，再通过Lean等证明检验系统确保全流程可信度。

针对"机器证明难以理解"的质疑，海勒表示："数学界高估了理解的价值，低估了信任的意义。当自动化推理能提供经过验证的证明时，其可信度反而超过许多纸笔证明。"他强调人类数学家的创造性直觉仍不可替代，未来理想模式将是"数学家、生成式AI与自动推理的三方协作"。

这项突破不仅可能改变数学研究范式，更揭示了人机协作的新可能：机器负责繁琐的逻辑验证，人类专注概念创新，共同推进数学前沿发展。

中文翻译：

要让机器证明数学定理，先把它变成一道谜题
引言
马里恩·赫勒在过去十年破解的数学难题，听起来就像科幻间谍小说里的代号：空六边形问题、舒尔数第五问题、七维凯勒猜想。实际上，这些是困扰几何学与组合数学界长达九十余年的顽固难题。赫勒运用名为"可满足性"（SAT）的计算万能工具将其逐一攻克。如今作为卡内基梅隆大学数学计算机辅助推理研究所成员，他相信将SAT与大型语言模型（LLM）结合，能打造出驯服纯数学领域更艰深问题的利器。

"LLM已在国际数学奥林匹克竞赛中斩获奖牌，但这些都是人类能解的问题。"赫勒表示，"我真正期待的是AI能解决人类无法破解的首个难题。SAT的妙处在于，它早已证明自己能解决多个不存在人类证明的难题。"

SAT实为人工智能的基础支柱，尽管它不同于那些因模仿流利对话或展现所谓"邪恶"思维而登上头条的AI。它属于符号人工智能（亦称GOFAI）传统，依靠硬编码规则而非深度神经网络中晦涩的交互来产出结果。从概念上说，SAT堪称最简形式的AI：它仅处理真/假两种取值的命题，通过牢不可破的逻辑链相互连接。若将问题分解为这类逻辑"原子"，名为SAT求解器的程序常能构建严密证明——此过程恰被称为"自动推理"。这些证明可能冗长到人类无法解读，但逻辑绝对严谨。

赫勒坦言自己的专长不在数学证明本身，而在于将问题转化为SAT求解器可处理的谜题式思维。这份天赋早现端倪："父母说我1岁还不会走路时，就能完成百块拼图。"他在代尔夫特理工大学本科期间初识可满足性课程，不久便构建首个SAT求解器，师从该领域"奠基人"汉斯·范马伦获得博士学位。二人后来合著了1500页的SAT权威教材。

赫勒长期痴迷于计算机能否解决超乎人类推理能力的问题。"我始终在追问：如何实现推理自动化？能模仿人类方式吗？还是需要全然不同的路径？目前我的结论是后者。所有成功都基于这个认知。"

《量子》杂志邀请赫勒探讨了机器与人类推理的差异、SAT如何以简驭繁，以及为何数学中"理解"被高估。以下为精简编辑的对话实录。

基础入门：什么是SAT？
它使用命题公式，可想象成巨型数独棋盘。每个单元格只有0/1两种选择，代表假/真。同时存在关于每行每列0/1数量的规则约束。能否填入所有0/1满足所有约束？

这种表述虽简洁却威力惊人。从软硬件验证、调度规划到纯数学领域，诸多重要问题都能转化为SAT问题。

这听起来就是二进制计算。SAT求解与普通计算机运算有何不同？
SAT工具的本质迥异于普通计算。标准程序接收输入，执行操作序列后输出结果。SAT工具并非用0/1计算，而是在搜索满足所有约束的0/1组合。

这更像解谜：探索可能性，用巧妙推理排除大量搜索空间，直到找到满足条件的赋值或证明其不存在。

生成式AI能为SAT求解器注入哪些新能量？
找到问题的合适表述（即编码）后，SAT工具威力无穷。我的专长正是设计精准表述。我深入研究过这些工具的推理机制，懂得如何激发其全部潜能。但理想情况是不再需要这种专业知识。我希望能从系统中抽身——让技术真正闪耀。

如何实现自动转译？若向LLM输入优质转译示例，其表现将远超普通用户。当然核心挑战在于如何确保转译正确性。

像陶哲轩这样的数学家认为生成式AI能助力研究进程。SAT在此中扮演什么角色？
LLM能生成大量听似合理的引理（用于证明更大定理的命题），自动推理则可验证其正确性。一旦出现错误，SAT求解器会反馈反例——理想情况下是最小反例。因为它擅长追溯：刚才的错误究竟依赖哪些条件？

这在数学中至关重要，反例对培养直觉极有价值。若问SAT求解器"是否存在反例"，你绝不希望得到庞大难懂的对象。小反例常能立即揭示命题失败原因。将此类反例反馈给LLM，能像指导人类那样引导它优化建议，提出更佳引理。

类似AI的"目标锁定计算机"？
正是。LLM可聚焦数学命题的整体框架，建议如何分解为小模块。完成分解后，自动推理会逐个处理：证明或反驳模块，同样重要的是验证模块是否全面覆盖，避免遗漏。

为确保整体可信度，需要Lean这类形式化证明检验器来核查所有证明与衔接：引理正确且完整构成最终命题。因此LLM负责高层切割，自动推理认证每个模块，Lean确保整体严丝合缝。当三者协同，模块才能真正拼合成型。

这些自动化证明可能冗长得人类无法阅读，更遑论理解。若本就难以理解的LLM加入，会否加剧该问题？
我早期工作主要受诟病于此。菲尔兹奖得主蒂莫西·高尔斯称我对毕达哥拉斯三元数问题的证明是"史上最令人反感的证明"。但我认为数学中"理解"被严重高估。

现实是没有任何数学家能理解全部数学。更常见的情况是：权威数学家们能就每个拼图碎片确认"已验证，此部分正确"。他人便可在此基础上继续构建。人们高估了理解价值，低估了信任价值。我们应拥抱信任作为推动数学的关键，而这正是自动化所能提供的。LLM尽可输出臆测，但只要自动推理能判定"此部分确凿无误，且附证明"，其可信度反而超越多数纸笔证明。只要有可信根基，就能据此推进数学发展。如此可视其为"合著者"——始终伴随左右，洞察推理漏洞。

假设所述LLM与SAT的协同体系建成，人类数学家还将扮演什么角色？
我用SAT解决开放问题时，始终与数学家合作。他们经年累月深思这些问题，我将其洞见编码后交由求解器收尾。若我独自从零开始，必将一事无成。

未来格局理应相似。我的角色是转译数学理解为SAT表述，但LLM能帮助更多数学家掌握此技能，不再需要我作为中介。这令人振奋。数学家、生成式AI与自动推理三者协同，我们有望攻克积年难题。但将人类完全排除在外实属谬误。创造性直觉与概念重构，始终是人类独有的天赋。真正的魔力源自协作。

英文来源：

To Have Machines Make Math Proofs, Turn Them Into a Puzzle
Introduction
The mathematical conundrums that Marijn Heule has helped crack in the last decade sound like code names lifted from a sci-fi spy novel: the empty hexagon. Schur Number 5. Keller’s conjecture, dimension seven. In reality, they are (or, more accurately, were) some of the most stubborn problems in geometry and combinatorics, defying solution for 90 years or more. Heule used a computational Swiss Army knife called satisfiability, or SAT, to whittle them into submission. Now, as a member of Carnegie Mellon University’s Institute for Computer-Aided Reasoning in Mathematics, he believes that SAT can be joined with large language models (LLMs) to create tools powerful enough to tame even harder problems in pure math.
“LLMs have won medals in the International Mathematical Olympiad, but these are all problems that humans can also solve,” Heule said. “I really want to see AI solve the first problem that humans cannot. And the cool thing about SAT is that it already has been shown that it was able to solve several problems for which there is no human proof.”
SAT is actually one of the foundations of artificial intelligence itself, although not the kind of AI that makes headlines by mimicking fluent conversation or spooking researchers with supposedly “evil” thoughts. Instead, SAT belongs to the tradition of symbolic artificial intelligence (also known as GOFAI, or “good old-fashioned AI”), which uses hard-coded rules — not the inscrutable interactions within a deep neural network — to produce results. In fact, SAT is about as simple as AI gets, conceptually speaking: It relies on statements that can have only two possible values, true or false, linked together in ironclad chains of logic. If problems can be ground down into these logical “atoms,” computer programs called SAT solvers can often build airtight proofs about them — a process called, appropriately, “automated reasoning.” Those proofs might be long, sometimes too long for humans to ever parse ourselves. But they are sound.
Heule admits his expertise isn’t necessarily in the actual math underlying such proofs, but rather in the puzzlelike thinking required to translate problems into a format that SAT solvers can attack. And his talent came early. “According to my parents, I was able to solve my first puzzle of 100 pieces when I was 1 year old — so, before I could walk,” Heule said. Heule took his first class in satisfiability as an undergraduate at the Delft University of Technology, built his first SAT solver soon after, and earned a doctorate in computer science under the supervision of Hans van Maaren, “one of the founding fathers of the field” of satisfiability research, Heule said. The two men went on to co-author a definitive 1,500-page textbook on SAT.
Heule says he’s long been intrigued by whether computers can solve problems beyond human reasoning. “That’s the question I’m still asking: How do you automate reasoning? Can it be done the way humans do, or does it need to be something completely different? My conclusion, so far, is the latter. All my successes are based on that insight.”
Quanta spoke to Heule about the difference between machine and human reasoning, how SAT’s simplicity is its secret weapon, and why understanding is overrated in mathematics. The interview has been condensed and edited for clarity.
First things first: What is SAT?
It uses something called a propositional formula, which you can imagine as a very big sudoku board. In every cell, you only have two options: only one or zero, standing for true or false. You also have the rules, or constraints, about how many zeros or ones can be in each row or column. Can you put in all the zeros and ones such that all those constraints are satisfied?
Despite its simplicity, this formulation is remarkably powerful. A wide variety of important problems, including hardware and software verification, scheduling, and even areas of pure mathematics, can be translated into SAT.
That just sounds like binary computation. How is SAT-solving different from anything else a digital computer does?
What SAT tools do is fundamentally different from ordinary computation. A standard program takes input and carries out a sequence of operations to produce output. A SAT tool is not computing with the zeros and ones. Instead, it is searching for a combination of them that satisfies all the constraints.
That makes it more like solving a puzzle: You explore possibilities, rule out large portions of the space using clever reasoning, and keep going until you either find a satisfying assignment or conclude that none exists.
What can generative AI add to the power of SAT solvers?
Once you’ve figured out the right representation of a problem, called the encoding, SAT tools are extremely powerful. One of my skills is that I’m really good at coming up with the right representation. I’ve really studied how these tools reason; I know what is required so that you can get everything out of them. But ideally, you wouldn’t need this knowledge. I would really like to take myself out of the equation — then the technology could really shine.
So how can we make these effective translations also automatic? If you feed an LLM with lots of good examples of how you should do this, it will come up with something much better than the average user would do. Of course, the challenge here is how to be sure that the translation is correct.
Mathematicians such as Terence Tao think that generative AI can help the research process itself. Where does SAT fit into that picture?
LLMs can generate a lot of plausible-sounding lemmas [statements that are used to prove larger theorems], and automated reasoning can check whether they’re correct or not. But as soon as something is incorrect, the SAT solver can give counterexamples back — ideally, the smallest counterexample. Because the solver is really good at figuring out: OK, that mistake I just made, what did it depend on?
This matters because, in mathematics, counterexamples are incredibly valuable for developing intuition. If you ask a SAT solver, “Does a counterexample exist?” you don’t want it to return an enormous, incomprehensible object. A small counterexample often reveals immediately why the statement fails. Feeding such counterexamples back to an LLM can guide it in the same way a human would: by helping it to refine its suggestions and to propose better lemmas next.
Kind of like a “targeting computer” for the AI?
Yes. I think LLMs can focus on the big picture of a mathematical statement and suggest how to break it into smaller pieces. Once you have that decomposition, automated reasoning can go through the pieces one by one: It can prove a piece, or it can refute it by giving a counterexample. And just as important, it can also check that the pieces really cover everything, so that nothing slips through the cracks.
To make the whole story trustworthy, you want a system like Lean, [a formal proof checker] that checks all the proofs and all the gluing: that the lemmas are correct and that they really add up to the full statement. So the LLM performs the high-level carving, automated reasoning certifies each piece, and Lean ensures the whole thing is watertight. When all of that lines up, the pieces really do add up.
These automated proofs can end up extremely long — far too long for humans to even read, much less understand. Won’t that become even more of a problem if LLMs, which are already hard to understand themselves, enter the picture?
The initial criticism of my work was mostly about this lack of understanding. [The Fields Medal–winning mathematician] Timothy Gowers called my work on the Pythagorean triples problem “the most disgusting proof ever.” But how I feel about this is that understanding in mathematics is highly overrated.
The thing is, there’s no mathematician alive who understands all of mathematics. It’s more that there are reputable mathematicians who can say, for each little piece of the puzzle, “OK, I checked. This is correct.” And now others can build on top of that. They overvalue understanding and undervalue trust. I think we should really embrace trust as the key thing to further mathematics, and this is what automation can give you. LLMs can do all of their bullshitting, but as soon as automated reasoning is able to say, “OK, but this part is actually correct, and here’s a proof,” this is actually more trustworthy than most of the pen-and-paper proofs out there. As long as there’s a good trust story, then you can build on top of that and really help mathematics. In such a way, you can see it as a “co-author” because it’s something that follows along with you and tries to spot the gaps in your reasoning.
Suppose that the productive interaction you describe between LLMs and SAT were actually built. What would be left for human mathematicians to do?
When I’ve solved open problems with SAT, it has always been together with mathematicians. They had already spent years thinking hard about these questions. I took their insights and encoded them so that the solver could finish the job. If I’d tried to do it alone, starting from scratch, I would have gotten nowhere.
I expect the future will look similar. My role has been to translate mathematical understanding into a SAT representation, but LLMs could help many more mathematicians learn how to do this themselves, so they don’t need me as the middleman. That’s incredibly exciting. With mathematicians, generative AI and automated reasoning working together, we have a real shot at cracking long-standing open problems. But removing humans from the loop entirely would be a mistake. The creative intuition, the conceptual reframing, that’s still something people are uniquely good at. The magic comes from the collaboration.

quanta

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读