An illustration provided by Google.
Enlarge / An illustration provided by Google.
On Thursday, Google DeepMind announced that its AI systems AlphaProof and AlphaGeometry 2 solved four of the six problems in this year’s International Mathematical Olympiad (IMO), earning a score equivalent to a silver medal. The tech giant claims this is the first time an AI has achieved this level of performance in the prestigious math competition. But as usual when it comes to AI, the claims aren’t as clear-cut as they seem.
Google says AlphaProof uses reinforcement learning to prove mathematical statements in a formal language called Lean. The system trains by generating and verifying millions of proofs, gradually tackling more difficult problems. Meanwhile, AlphaGeometry 2 is described as an improved version of Google’s previous geometry-solving AI model, now powered by a Gemini-based language model trained on much more data.
According to Google, mathematicians Sir Timothy Gowers and Dr. Joseph Myers evaluated the AI model’s solutions using the official IMO rules. The company says its combined system scored 28 points out of a possible 42, just shy of the 29-point threshold for the gold medal. That included a perfect score on the competition’s hardest problem, which Google says only five human competitors solved this year.
A Mathematics Competition Like No Other
The IMO, held annually since 1959, challenges elite pre-college mathematicians to exceptionally difficult problems in algebra, combinatorics, geometry, and number theory. Results on IMO problems have become a recognized benchmark for assessing the mathematical reasoning capabilities of an AI system.
Google says AlphaProof solved two algebra problems and one number theory problem, while AlphaGeometry 2 tackled the geometry question. The AI model reportedly failed to solve the two combinatorics problems. The company says its systems solved one problem in minutes, while others took up to three days.
Google says it first translated the IMO problems into formal mathematical language so its AI model could process them. This step differs from the official competition, where human competitors work directly on the problem statements over two 4.5-hour sessions.
Google says that before this year’s competition, AlphaGeometry 2 was able to solve 83% of the IMO’s historical geometry problems from the past 25 years, compared to 53% success for its predecessor. The company says the new system solved this year’s geometry problem in 19 seconds after being given the formalized version.
Boundaries
Despite Google’s claims, Sir Timothy Gowers offered a more nuanced perspective on Google DeepMind’s models in a thread posted on X. While acknowledging that this achievement is “well beyond what automatic theorem provers could do before,” Gowers highlighted several key caveats.
“The main negative is that the program required much more time than its human competitors – over 60 hours for some problems – and of course a much faster processing speed than the poor human brain,” Gowers wrote. “If the human competitors had been allowed that kind of time per problem, they would undoubtedly have performed better.”
Gowers also noted that humans manually translated the problems into the formal Lean language before the AI model began its work. He pointed out that while AI performed the basic mathematical reasoning, this “self-formalization” step was performed by humans.
As for the broader implications for mathematical research, Gowers expressed uncertainty. “Are we close to the point where mathematicians will be redundant? It’s hard to say. I suspect we’re still a breakthrough or two away from that,” he wrote. He suggested that the system’s long processing times indicate that it hasn’t “solved the math” but acknowledged that “there’s clearly something interesting going on when it’s running.”
Despite these limitations, Gowers speculated that such AI systems could become valuable research tools. “So we might be on the verge of developing a program that would allow mathematicians to get answers to a wide range of questions, as long as those questions aren’t too difficult—the sort of thing you can do in a few hours. It would be an extremely useful research tool, even if it weren’t itself capable of solving open problems.”
0 Comments