Step Aside Mathematicians, AlphaProof is Coming

By | July 28, 2024

At the headquarters of Google DeepMind, the artificial intelligence lab in London, researchers have a long-standing ritual for announcing major results: They ring a large ceremonial gong.

In 2016, the gong rang for AlphaGo, an artificial intelligence system that excelled at the game of Go. In 2017, the gong rang out when AlphaZero conquered chess. Each time, the algorithm had beaten human world champions.

Last week, DeepMind researchers rang the gong again to celebrate what Alex Davies, the leader of Google DeepMind’s mathematics initiative, described as a “tremendous breakthrough” in mathematical reasoning by an artificial intelligence system. A pair of Google DeepMind models tried their hand at the problem set at the 2024 International Mathematical Olympiad (IMO), which is being held at the University of Bath, about 100 miles west of London, from July 11 to July 22. The event is said to be the premier mathematics competition for the world’s “brightest mathematicians,” according to a promotional post on social media.

Sign up for The Morning newsletter from the New York Times

Human problem solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver, and 145 bronze. The AI ​​performed at silver medal level, solving 4 out of 6 problems for a total score of 28. It was the first time an AI had medal-worthy performance on an Olympic problem.

“It’s not perfect, we don’t have everything figured out,” Pushmeet Kohli, vice president of research at Google DeepMind, said in an interview. “We want to be perfect.”

Kohli, however, described the outcome as a “phase transition”, that is, “a transformative shift in the use of AI in mathematics and the ability of AI systems to do mathematics.”

To assess the AI’s performance, the lab enlisted the help of two independent experts: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields medalist who has been interested in the interaction of mathematics and AI for 25 years, and Joseph Myers, a software developer at Cambridge. Both have won IMO gold medals in their time. Myers was chair of this year’s problem selection committee and served as a coordinator for evaluating human solutions at previous Olympiads. “I tried to evaluate AI efforts this year in a way that was consistent with how human efforts were evaluated,” he said.

“I was absolutely impressed,” Gowers added in an email. “The lab had discussed the Olympic goals with him a few weeks in advance, so “my expectations were pretty high,” he said. “But the program met them and in one or two cases significantly exceeded them.” He said the program found the “magic keys” that solved the problems.

Hit the gong

After months of rigorous training, students took two exams, three questions per day, in algebra, combinatorics, geometry, and number theory.

Its AI counterpart ran at roughly the same time in the London lab. (The students were unaware that Google DeepMind was competing, in part because the researchers didn’t want to be the center of attention.) The researchers carried the gong into the room where they gathered to watch the system work. “Every time the system solved a problem, we rang the gong to celebrate,” said research scientist David Silver.

Haojia Shi, a student from China, came in first and was the only competitor to receive a perfect score — 42 points for six problems; each problem is worth seven points for a complete solution. The U.S. team came in first with 192 points; China came in second with 190 points.

The Google system earned 28 points by solving four problems exactly—two in algebra, one in geometry, and one in number theory. (It failed two combinatorics problems.) The system was given unlimited time—up to three days for some problems. Students were allotted just 4.5 hours per exam.

Silver said that for the Google DeepMind team, speed was more important than overall success because “it’s really just a matter of how much processing power you’re willing to put into these things.”

“The fact that we’ve reached this threshold, that we’ve come to a point where it’s possible to tackle these problems, represents a step change in the history of mathematics,” he added. “And I hope it represents not just a step change IMO, but also the point where we’ve moved from the point where computers can only prove very, very simple things to the point where computers can prove things that humans can’t.”

Algorithmic components

Applying AI to mathematics has been part of DeepMind’s mission, and it has often done so by collaborating with mathematicians conducting research around the world.

“Mathematics requires this interesting combination of abstract, precise, and creative reasoning,” Davies said. In part, he noted, this repertoire of abilities is what makes math a good litmus test for the ultimate goal: achieving so-called artificial general intelligence, or AGI, a system with abilities ranging from novice to gifted, from virtuosos to superhumans. Companies like OpenAI, Meta AI, and xAI are pursuing similar goals.

Olympiad mathematics questions are now accepted as a benchmark.

In January, a Google DeepMind system called AlphaGeometry solved a sample of Olympic geometry problems at nearly the level of a human gold medalist. “AlphaGeometry 2 now surpasses gold medalists in solving IMO problems,” lead researcher Thang Luong said in an email.

Building on that momentum, Google DeepMind has intensified its interdisciplinary Olympic effort with two teams, one led by Thomas Hubert, a research engineer in London, and one led by Luong and Quoc Le in Mountain View, each comprising about 20 researchers. Luong said he recruited a dozen IMO medalists for his “superhuman reasoning team” — “the highest concentration of IMO medalists at Google!”

The lab’s Olympic strike this year used a refined version of AlphaGeometry. Unsurprisingly, the model performed quite well on the geometry problem, solving it in 19 seconds.

Hubert’s team developed a new model that is comparable but more generalized. This model, called AlphaProof, is designed to interact with a wide range of mathematical topics. Overall, AlphaGeometry and AlphaProof leverage a number of different AI technologies.

One approach was an informal reasoning system expressed in natural language. This system took advantage of Google’s large language model, Gemini. It used the English corpus of published problems and proofs and the like as training data.

The informal system is excellent at spotting patterns and suggesting what comes next; it is creative and speaks ideas clearly. Of course, large language models tend to make things up — this may (or may not) work for poetry, and it certainly doesn’t work for mathematics. But in this context, LLM seems to have shown restraint; it wasn’t immune to hallucination, but its frequency was reduced.

Another approach was a formal reasoning system based on logic and expressed in code. It used theorem prover and proof assistant software called Lean, which guaranteed that if the system said a proof was true, then it really was true. “We can fully check that the proof is true,” Hubert said. “Every step is guaranteed to be logically sound.”

Another key component was the reinforcement learning algorithm that descended from AlphaGo and AlphaZero. Silver, Google DeepMind’s vice president of reinforcement learning, said this type of AI is self-learning and could scale indefinitely. Because the algorithm doesn’t need a human teacher, it can “learn and keep learning and keep learning until eventually it can solve the hardest problems that humans can solve,” he said. “And then maybe one day even go beyond that.”

Hubert added, “The system can rediscover knowledge on its own.” That’s what happened with AlphaZero: It started with zero knowledge, Hubert said, “and just by playing games and seeing who won and who lost, it could rediscover all the knowledge about chess. It took us less than a day to rediscover all the knowledge about chess, and it took us about a week to rediscover all the knowledge about Go. So we thought, let’s apply this to mathematics.”

Gowers isn’t too worried about the long-term consequences. “It’s possible to imagine a situation where mathematicians would basically have nothing left to do,” he said. “That would be the case if computers got better and faster at everything mathematicians do now.”

“It seems like there’s still a long way to go before computers can do research-level math,” he added. “If Google DeepMind can solve at least some of the hard IMO problems, then it’s a pretty safe bet that a useful research tool isn’t too far away.”

A truly capable tool could make mathematics accessible to more people, speed up research, push mathematicians beyond the norm, and ultimately lead to resonant new ideas.

c.2024 The New York Times Company

Leave a Reply

Your email address will not be published. Required fields are marked *