There was a little more information in that reddit thread. Of the three difficulty tiers, 25% are T1 (easiest) and 50% are T2. Of the five public problems that the author looked at, two were T1 and two were T2. Glazer on reddit described T1 as "IMO/undergraduate problems", but the article author says that they don't consider them to be undergraduate problems. So the LLM is already doing what the author says they would be surprised about.
Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"
Also, all of the problems shows to Tao were T3
> So the LLM is already doing what the author says they would be surprised about.
that's if you unconditionally believe in result without any proofreading, confirmation, reproducability and even barely any details (we are given only one slide).
I just spent a few days trying to figure out some linear algebra with the help of ChatGPT. It's very useful for finding conceptual information from literature (which for a not-professional-mathematician at least can be really hard to find and decipher). But in the actual math it constantly makes very silly errors. E.g. indexing a vector beyond its dimension, trying to do matrix decomposition for scalars and insisting on multiplying matrices with mismatching dimensions.
O1 is a lot better at spotting its errors than 4o but it too still makes a lot of really stupid mistakes. It seems to be quite far from producing results itself consistently without at least a somewhat clueful human doing hand-holding.
I wonder if these are tokenization issues? I really am curious about metas byte tokenization scheme...
Probably mostly not. The errors tend to be logical/conceptual. E.g. mixing up scalars and matrices is unlikely to be from tokenization. Especially if using spaces between the variables and operators, as AFAIK GPTs don't form tokens over spaces (although tokens may start or end with them).
The only thing I've consistently had issues with while using AI is graphs. If I ask it to put some simple function, it produces a really weird image that has nothing to do with the graph I want. It will be a weird swirl of lines and words, and it never corrects itself no matter what I say to it.
Has anyone had any luck with this? It seems like the only thing that it just can't do.
You're doing it wrong. It can't produce proper graphs with it's diffusion style image generation.
Ask it to produce graphs with python and matplotlib. That will work.
Ask it to plot the graph with python plotting utilities. Not using its image generator. I think you need a ChatGPT subscription though for it to be able to run python code.
You seem to get 2(?) free Python program runs per week(?) as part of the 01 preview.
When you visit chatgpt on the free account it automatically gives you the best model and then disables it after some amount of work and says to come back later or upgrade.
Just install Python locally, and copy paste the code.
Shouldn’t ChatGPT be smart enough to know to do this automatically, based on context?
The agentic reasoning models should be able to fix this if they have the ability to run code instead of giving each task to itself. "I need to make a graph" "LLMs have difficulty graphing novel functions" "Call python instead" is a line of reasoning I would expect after seeing what O1 has come up with on other problems.
Giving AI the ability to execute code is the safety peoples nightmare though, wonder if we'll hear anything from them as this is surely coming
[deleted]
Isn't Wolfram Alpha a better "ChatGPT of Math"?
Wolfram Alpha is better at actually doing math, but far worse at explaining what it’s doing, and why.
What’s worse about it?
It never tells you the wrong thing, at the very least.
When you give it a large math problem and the answer is "seven point one three five ... ", and it shows a plot of the result v some randomly selected domain, well there could be more I'd like to know.
You can unlock a full derivation of the solution, for cases where you say "Solve" or "Simplify", but what I (and I suspect GP) might want, is to know why a few of the key steps might work.
It's a fantastic tool that helped get me through my (engineering) grad work, but ultimately the breakthrough inequalities that helped me write some of my best stuff were out of a book I bought in desperation that basically cataloged linear algebra known inequalities and simplifications.
When I try that kind of thing with the best LLM I can use (as of a few months ago, albeit), the results can get incorrect pretty quickly.
Its understanding of problems was very bad last time I used it. Meaning it was difficult to communicate what you wanted it to do. Usually I try to write in the Mathematica language, but even that is not foolproof.
Hopefully they have incorporated more modern LLM since then, but it hasn’t been that long.
Wolfram Alpha's "smartness" is often Clippy level enraging. E.g. it makes assumptions of symbols based on their names (e.g. a is assumed to be a constant, derivatives are taken w.r.t. x). Even with Mathematica syntax it tends to make such assumptions and refuses to lift them even when explicitly directed. Quite often one has to change the variable symbols used to try to make Alpha to do what's meant.
I wish there was a way to tell Chatgpt where it has made a mistake, with a single mouse click.
Wolfram Alpha can solve equations well, but it is terrible at understanding natural language.
For example I asked Wolfram Alpha "How heavy a rocket has to be to launch 5 tons to LEO with a specific impulse of 400s", which is a straightforward application of the Tsiolkovsky rocket equation. Wolfram Alpha gave me some nonsense about particle physics (result: 95 MeV/c^2), GPT-4o did it right (result: 53.45 tons).
Wolfram alpha knows about the Tsiolkovsky rocket equation, it knows about LEO (low earth orbit), but I found no way to get a delta-v out of it, again, more nonsense. It tells me about Delta airlines, mentions satellites that it knows are not in LEO. The "natural language" part is a joke. It is more like an advanced calculator, and for that, it is great.
You're using it wrong, you can use natural language in your equation, but afaik it's not supposed to be able to do what you're asking of it.
Wolfram Alpha is mostly for "trivia" type problems. Or giving solutions to equations.
I was figuring out some mode decomposition methods such as ESPRIT and Prony and how to potentially extend/customize them. Wolfram Alpha doesn't seem to have a clue about such.
No. Wolfram Alpha can't solve anything that isn't a function evaluation or equation. And it can't do modular arithmetic to save its unlife.
WolframOne/Mathematica is better, but that requires the user (or ChatGPT!)to write complicated code, not natural language queries.
Don't most mathematical papers contain at least one such error?
Where is this data from?
Yesterday, I saw a thought provoking talk about the future of of "math jobs" assuming automated theory proving becomes more prevalent in the future.
[ (Re)imagining mathematics in a world of reasoning machines by Akshay Venkatesh]
https://www.youtube.com/watch?v=vYCT7cw0ycw [54min]
Abstract: In the coming decades, developments in automated reasoning will likely transform the way that research mathematics is conceptualized and carried out. I will discuss some ways we might think about this. The talk will not be about current or potential abilities of computers to do mathematics—rather I will look at topics such as the history of automation and mathematics, and related philosophical questions.
See discussion at https://news.ycombinator.com/item?id=42465907
As someone who has a 18 yo son who wants to study math, this has me (and him) ... worried ... about becoming obsolete?
But I'm wondering what other people think of this analogy.
I used to be a bench scientist (molecular genetics).
There were world class researchers who were more creative than I was. I even had a Nobel Laureate once tell me that my research was simply "dotting 'i's and crossing 't's".
Nevertheless, I still moved the field forward in my own small ways. I still did respectable work.
So, will these LLMs make us completely obsolete? Or will there still be room for those of us who can dot the "i"?--if only for the fact that LLMs don't have infinite time/resources to solve "everything."
I don't know. Maybe I'm whistling past the graveyard.
What LLMs can do is limited, they are superior to wet-wear in some tasks like finding and matching patterns in higher dimensional space, they are still fundamentally limited to a tiny class of problems outside of that pattern finding and matching.
LLMs will be tools for some math needs and even if we ever get quantum computers will be limited in what they can do.
LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations.
There are several groups sitting on what are known limitations of LLMs, waiting to take advantage of those who don't understand the fundamental limitations, simplicity bias etc...
The hype will meet reality soon and we will figure out where they work and where they are problematic over the next few years.
But even the most celebrated achievements like proof finding with Lean, heavily depends on smart people producing hints that machines can use.
Basically lots of the fundamental hints of the limits of computation still hold.
Model logic may be an accessable way to approach the limits of statistical inference if you want to know one path yourself.
A lot of what is in this article relates to some the known fundamental limitations.
Remember that for all the amazing progress, one of the core founders of the perceptron, Pitts drank him self to death in the 50s because it was shown that they were insufficient to accurately model biological neurons.
Optimism is high, but reality will hit soon.
So think of it as new tools that will be available to your child, not a replacement.
"LLMs, without pattern matching, can only do up to about integer division, and while they can calculate parity, they can't use it in their calculations." - what do you mean by this? Counting the number of 1's in a bitstring and determining if it's even or odd?
I was just thinking about this. I already posted a comment here, but I will say that as a mathematician (PhD in number theory), that for me, AI signficantly takes away the beauty of doing mathematics within a realm in which AI is used.
The best part of math (again, just for me) is that it was a journey that was done by hand with only the human intellect that computers didn't understand. The beauty of the subject was precisely that it was a journey of human intellect.
As I said elsewhere, my friends used to ask me why something was true and it was fun to explain it to them, or ask them and have them explain it to me. Now most will just use some AI.
Soulless, in my opinion. Pure mathematics should be about the art of the thing, not producing results on an assembly line like it will be with AI. Of course, the best mathematicians are going into this because it helps their current careers, not because it helps the future of the subject. Math done with AI will be a lot like Olympic running done with performance-enhancing drugs.
Yes, we will get a few more results, faster. But the results will be entirely boring.
Presumably people who get into math going forward will feel differently.
For myself, chasing lemmas was always boring — and there’s little interest in doing the busywork of fleshing out a theory. For me, LLMs are a great way to do the fun parts (conceptual architecture) without the boring parts.
And I expect we’ll such much the same change as with physics: computers increase the complexity of the objects we study, which tend to be rather simple when done by hand — eg, people don’t investigate patterns in the diagrams of group(oids) because drawing million element diagrams isn’t tractable by hand. And you only notice the patterns in them when you see examples of the diagrams at scale.
Even current people will feel differently. I don't bemoan the fact that Lean/Mathlib has `simp` and `linarith` to automate trivial computations. A "copilot for Lean" that can turn "by induction, X" or "evidently Y" into a formal proof sounds great.
The the trick is teaching the thing how high powered of theorems to use or how to factor out details or not depending on the user's level of understanding. We'll have to find a pedagogical balance (e.g. you don't give `linarith` to someone practicing basic proofs), but I'm sure it will be a great tool to aid human understanding.
A tool to help translate natural language to formal propositions/types also sounds great, and could help more people to use more formal methods, which could make for more robust software.
I used to do bench top work too; and was blessed with “the golden hands” in that I could almost always get protocols working. To me this always felt more like intuition than deductive reasoning. And it made me a terrible TA. My advice to students in lab was always something along the lines of “just mess around with it, and see how it works.” Not very helpful for the stressed and struggling student -_-
Digression aside, my point is that I don’t think we know exactly what makes or defines “the golden hands”. And if that is the case, can we optimize for it?
Another point is that scalable fine tuning only works for verifiable stuff. Think a priori knowledge. To me that seems to be at the opposite end of the spectrum from “mess with it and see what happens”.
By the way, don't trust Nobel laureates or even winners. E.g. Linus Pauling was talking absolute garbage, harmful and evil, after winning the Nobel.
> don't trust Nobel laureates or even winners
Nobel laureate and winner are the same thing.
> Linus Pauling was talking absolute garbage, harmful and evil, after winning the Nobel.
Can you be more specific, what garbage? And which Nobel prize do you mean – Pauling got two, one for chemistry and one for peace.
Thank you, my bad.
I was referring to Linus's harmful and evil promotion of Vitamin C as the cure for everything and cancer. I don't think Linus was attaching that garbage to any particular Nobel prize. But people did say to their doctors: "Are you a Nobel winner, doctor?". Don't think they cared about particular prize either.
Eugenics and vitamin C as a cure all.
Eventually we may produce a collection of problems exhaustive enough that these tools can solve almost any problem that isn't novel in practice, but I doubt that they will ever become general problem solvers capable of what we consider to be reasoning in humans.
Historically, the claim that neural nets were actual models of the human brain and human thinking was always epistemically dubious. It still is. Even as the practical problems of producing better and better algorithms, architectures, and output have been solved, there is no reason to believe a connection between the mechanical model and what happens in organisms has been established. The most important point, in my view, is that all of the representation and interpretation still has to happen outside the computational units. Without human interpreters, none of the AI outputs have any meaning. Unless you believe in determinism and an overseeing god, the story for human beings is much different. AI will not be capable of reason until, like humans, it can develop socio-rational collectivities of meaning that are independent of the human being.
Researchers seemed to have a decent grasp on this in the 90s, but today, everyone seems all too ready to make the same ridiculous leaps as the original creators of neural nets. They did not show, as they claimed, that thinking is reducible to computation. All they showed was that a neural net can realize a boolean function—which is not even logic, since, again, the entire semantic interpretive side of the logic is ignored.
> Unless you believe in determinism and an overseeing god
Or perhaps, determinism and mechanistic materialism - which in STEM-adjacent circles has a relatively prevalent adherence.
Worldviews which strip a human being of agency in the sense you invoke crop up quite a lot today in such spaces. If you start of adopting a view like this, you have a deflationary sword which can cut down most any notion that's not mechanistic in terms of mechanistic parts. "Meaning? Well that's just an emergent phenomenon of the influence of such and such causal factors in the unrolling of a deterministic physical system."
Similar for reasoning, etc.
Now obviously large swathes of people don't really subscribe to this - but it is prevalent and ties in well with utopian progress stories. If something is amenable to mechanistic dissection, possibly it's amenable to mechanistic control. And that's what our education is really good at teaching us. So such stories end up having intoxicating "hype" effects and drive fundraising, and so we get where we are.
For one, I wish people were just excited about making computers do things they couldn't do before, without needing to dress it up as something more than it is. "This model can prove a set of theorems in this format with such and such limits and efficiency"
Agreed. If someone believes the world is purely mechanistic, then it follows that a sufficiently large computing machine can model the world---like Leibniz's Ratiocinator. The intoxication may stem from the potential for predictability and control.
The irony is: why would someone want control if they don't have true choice? Unfortunately, such a question rarely pierces the intoxicated mind when this mind is preoccupied with pass the class, get an A, get a job, buy a house, raise funds, sell the product, win clients, gain status, eat right, exercise, check insta, watch the game, binge the show, post on Reddit, etc.
I'm with you. Interpreting a problem as a problem requires a human (1) to recognize the problem and (2) to convince other humans that it's a problem worth solving. Both involve value, and value has no computational or mechanistic description (other than "given" or "illusion"). Once humans have identified a problem, they might employ a tool to find the solution. The tool has no sense that the problem is important or even hard; such values are imposed by the tool's users.
It's worth considering why "everyone seems all too ready to make ... leaps ..." "Neural", "intelligence", "learning", and others are metaphors that have performed very well as marketing slogans. Behind the marketing slogans are deep-pocketed, platformed corporate and government (i.e. socio-rational collective) interests. Educational institutions (another socio-rational collective) and their leaders have on the whole postured as trainers and preparers for the "real world" (i.e. a job), which means they accept, support, and promote the corporate narratives about techno-utopia. Which institutions are left to check the narratives? Who has time to ask questions given the need to learn all the technobabble (by paying hundreds of thousands for 120 university credits) to become a competitive job candidate?
I've found there are many voices speaking against the hype---indeed, even (rightly) questioning the epistemic underpinnings of AI. But they're ignored and out-shouted by tech marketing, fundraising politicians, and engagement-driven media.
> there is no reason to believe a connection between the mechanical model and what happens in organisms has been established
The universal approximation theorem. And that's basically it. The rest is empirical.
No matter which physical processes happen inside the human brain, a sufficiently large neural network can approximate them. Barring unknowns like super-Turing computational processes in the brain.
The universal approximation theorem is set in a precise mathematical context; I encourage you to limit its applicability to that context despite the marketing label "universal" (which it isn't). Consider your concession about empiricism. There's no empirical way to prove (i.e. there's no experiment that can demonstrate beyond doubt) that all brain or other organic processes are deterministic and can be represented completely as functions.
Function is the most general way of describing relations. Non-deterministic processes can be represented as functions with a probability distribution codomain. Physics seems to require only continuous functions.
Sorry, but there's not much evidence that can support human exceptionalism.
Some differential equations that model physics admit singularities and multiple solutions. Therefore, functions are not the most general way of describing relations. Functions are a subset of relations.
Although "non-deterministic" and "stochastic" are often used interchangeably, they are not equivalent. Probability is applied analysis whose objects are distributions. Analysis is a form of deductive, i.e. mechanical, reasoning. Therefore, it's more accurate (philosophically) to identify mathematical probability with determinism. Probability is a model for our experience. That doesn't mean our experience is truly probabilistic.
Humans aren't exceptional. Math modeling and reasoning are human activities.
That's not useful by itself, because "anything cam model anything else" doesn't put any upper bound on emulation cost, which for one small task could be larger than the total energy available in the entire Universem
Either the brain violates the physical Church-Turing thesis or it's not.
If it does, well, it will take more time to incorporate those physical mechanisms into computers to get them on par with the brain.
I leave the possibility that it's "magic"[1] aside. It's just impossible to predict, because it will violate everything we know about our physical world.
[1] One example of "magic": we live in a simulation and the brain is not fully simulated by the physics engine, but creators of the simulation for some reason gave it access to computational resources that are impossible to harness using the standard physics of the simulated world. Another example: interactionistic soul.
I mean, that is why they mention super-Turning processes like quantum based computing.
Quantum computing actually isn't super-Turing, it "just" computes some things faster. (Strictly speaking it's somewhere between a standard Turing machine and a nondeterministic Turing machine in speed, and the first can emulate the second.)
I hear these arguments a lot from law and philosophy students, never from those trained in mathematics. It seems to me, "literary" people will still be discussing these theoretical hypotheticals as technology passes them by building it.
I straddle both worlds. Consider that using the lens of mathematical reasoning to understand everything is a bit like trying to use a single mathematical theory (eg that of groups) to comprehend mathematics as a whole. You will almost always benefit and enrich your own understanding by daring to incorporate outside perspectives.
Consider also that even as digital technology and the ratiomathimatical understanding of the world has advanced it is still rife with dynamics and problems that require a humanistic approach. In particular, a mathematical conception cannot resolve teleological problems which require the establishment of consensus and the actual determination of what we, as a species, want the world to look like. Climate change and general economic imbalance are already evidence of the kind of disasters that mount when you limit yourself to a reductionistic, overly mathematical and technological understanding of life and existence. Being is not a solely technical problem.
Can you define what you mean by novel here?
I don't have much to opine from an advanced maths perspective, but I'd like to point out a couple examples of where ChatGPT made basic errors in questions I asked it as an undergrad CS student.
1. I asked it to show me the derivation of a formula for the efficiency of Stop-and-Wait ARQ and it seemed to do it, but a day later, I realised that in one of the steps, it just made a term vanish to get to the next step. Obviously, I should have verified more carefully, but when I asked it to spot the mistake in that step, it did the same thing twice more with bs explanations of how the term is absorbed.
2. I asked it to provide me syllogisms that I could practice proving. An overwhelming number of the syllogisms it gave me were inconsistent and did not hold. This surprised me more because syllogisms are about the most structured arguments you can find, having been formalized centuries ago and discussed extensively since then. In this case, asking it to walk step-by-step actually fixed the issue.
Both of these were done on the free plan of ChatGPT, but I can remember if it was 4o or 4.
It's fascinating that this has run into the exact same problem as the Quantum research. Ie, in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer. If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness (ie, you can't just check it was matching the traditional computer's answer.
In the same way ChatGPT scores 25% on this and the question is "How close were those 25% to questions in the training set". Or to put it another way we want to answer the question "Is ChatGPT getting better at applying it's reasoning to out-of-set problems or is it pulling more data into it's training set". Or "Is the test leaking into the training".
Maybe the whole question is academic and it doesn't matter, we solve the entire problem by pulling all human knowledge into the training set and that's a massive benefit. But maybe it implies a limit to how far it can push human knowledge forward.
>in the quantum research to demonstrate any valuable forward progress you must compute something that is impossible to do with a traditional computer
This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers. The factorization problem is the most classical example.
The problem is that existing quantum computers are not powerful enough to solve the interesting problems, so researchers have to invent semi-artificial problems to demonstrate "quantum advantage" to keep the funding flowing.
There is a plethora of opportunities for LLMs to show their worth. For example, finding interesting links between different areas of research or being a proof assistant in a math/programming formal verification system. There is a lot of ongoing work in this area, but at the moment signal-to-noise ratio of such tools is too low for them to be practical.
No, it is factually right, at least if Scott Aaronson is to be believed:
> Having said that, the biggest caveat to the “10^25 years” result is one to which I fear Google drew insufficient attention. Namely, for the exact same reason why (as far as anyone knows) this quantum computation would take ~10^25 years for a classical computer to simulate, it would also take ~10^25 years for a classical computer to directly verify the quantum computer’s results!! (For example, by computing the “Linear Cross-Entropy” score of the outputs.) For this reason, all validation of Google’s new supremacy experiment is indirect, based on extrapolations from smaller circuits, ones for which a classical computer can feasibly check the results. To be clear, I personally see no reason to doubt those extrapolations. But for anyone who wonders why I’ve been obsessing for years about the need to design efficiently verifiable near-term quantum supremacy experiments: well, this is why! We’re now deeply into the unverifiable regime that I warned about.
It's a property of the "semi-artificial" problem chosen by Google. If anything, it means that we should heavily discount this claim of "quantum advantage", especially in the light of inherent probabilistic nature of quantum computations.
Note that the OP wrote "you MUST compute something that is impossible to do with a traditional computer". I demonstrated a simple counter-example to this statement: you CAN demonstrate forward progress by factorizing big numbers, but the problem is that no one can do it despite billions of investments.
Apparently they can't, right now, as you admit. Anyway this is turning into a stupid semantic argument, have a nice day.
If they can't, then is it really quantum supremacy?
They claimed it last time in 2019 with Sycamore, which could perform in 200 seconds a calculation that Google claimed would take a classical supercomputer 10,000 years.
That was debunked when a team of scientists replicated the same thing on an ordinary computer in 15 hours with a large number of GPUs. Scott Aaronson said that on a supercomputer, the same technique would have solved the problem in seconds.[1]
So if they now come up with another problem which they say cannot even be verified by a classical computer and uses it to claim quantum advantage, then it is right to be suspicious of that claim.
1. https://www.science.org/content/article/ordinary-computers-c...
the unverifiable regime is a great way to extract funding.
> This is factually wrong.
What's factually wrong about it? OP said "you must compute something that is impossible to do with a traditional computer" which is true, regardless of the output produced. Verifying an output is very different from verifying the proper execution of a program. The difference between testing a program and seeing its code.
What is being computed is fundamentally different from classical computers, therefore the verification methods of proper adherence to instructions becomes increasingly complex.
They left out the key part which was incorrect and the sentence right after "If you can't do it with a traditional computer, it suddenly becomes difficult to verify correctness"
The point stands that for actually interesting problems verifying correctness of the results is trivial. I don't know if "adherence to instructions" transudates at all to quantum computing.
> This is factually wrong. The most interesting problems motivating the quantum computing research are hard to solve, but easy to verify on classical computers.
You parent did not talk about quantum computers. I guess he rather had predictions of novel quantum-field theories or theories of quantum gravity in the back of his mind.
Then his comment makes even less sense.
I agree with the issue of ”is the test dataset leaking into the training dataset” being an issue with interpreting LLM capabilities in novel contexts, but not sure I follow what you mean on the quantum computing front.
My understanding is that many problems have solutions that are easier to verify than to solve using classical computing. e.g. prime factorization
Oh it's a totally different issue on the quantum side that leads to the same issue with difficulty verifying. There, the algorithms that Google for example is using today, aren't like prime factorization, they're not easy to directly verify with traditional computers, so as far as I'm aware they kind of check the result for a suitably small run, and then do the performance metrics on a large run that they hope gave a correct answer but aren't able to directly verify.
How much of this could be resolved if its training set were reduced? Conceivably, most of the training serves only to confuse the model when only aiming to solve a math equation.
If constrained by existing human knowledge to come up with an answer, won’t it fundamentally be unable to push human knowledge forward?
Depends on your understanding of human knowledge I guess? People talk about the frontier of human knowledge and if your view of knowledge is like that of a unique human genius pushing forward the frontier then yes - it'd be stuck. But if you think of knowledge as more complex than that you could have areas that are kind of within our frontier of knowledge (that we could reasonably know, but don't actually know) - taking concepts that we already know in one field and applying them to some other field. Today the reason that doesn't happen is because genius A in physics doesn't know about the existence of genius B in mathematics (let alone understand their research), but if it's all imbibed by "The Model" then it's trivial to make that discovery.
I was referring specifically to the parent comments statements around current AI systems.
There are likely lots of connections that could be made that no individual has made because no individual has all of existing human knowledge at their immediate disposal.
Reasoning is essentially the creation of new knowledge from existing knowledge. The better the model can reason the less constrained it is to existing knowledge.
The challenge is how to figure out if a model is genuinely reasoning
Reasoning is a very minor (but essential) part of knowledge creation.
Knowledge creation comes from collecting data from the real world, and cleaning it up somehow, and brainstorming creative models to explain it.
NN/LLM's version of model building is frustrating because it is quite good, but not highly "explainable". Human models have higher explainability, while machine models have high predictive value on test examples due to an impenetrable mountain of algebra.
Then much of human research and development is also fundamentally impossible.
Only if you think current "AI" is on the same level as human creativity and intelligence, which it clearly is not.
I think current "AI" (i.e. LLMs) is unable to push human knowledge forward, but not because it's constrained by existing human knowledge. It's more like peeking into a very large magic-8 ball, new answers everytime you shake it. Some useful.
It may be able to push human knowledge forward to an extent.
In the past, there was quite a bit of low hanging fruit such that you could have polymaths able to contribute to a wide variety of fields, such as Newton.
But in the past 100 years or so, the problem is there is so much known, it is impossible for any single person to have deep knowledge of everything. e.g. its rare to find a really good mathematician who also has a deep knowledge (beyond intro courses) about say, chemistry.
Would a sufficiently powerful AI / ML model be able to come up with this synthesis across fields?
That's not a strong reason. Yes, that means ChatGPT isn't good at wholly independently pushing knowledge forward, but a good brainstormer that is right even 10% of the time is an incredible found of knowledge.
I don't think many expect AI to push knowledge forward? A thing that basically just regurgitates consensus historic knowledge seems badly suited to that
But apparently these new frontier models can 'reason' - so with that logic, they should be able to generate new knowledge?
O1 was able to find the math problem in a recently published paper, so yes.
I didn't see anyone else ask this but.. isn't the FrontierMath dataset compromised now? At the very least OpenAI now knows the questions if not the answers. I would expect that the next iteration will "magically" get over 80% on the FrontierMath test. I imagine that experiment was pretty closely monitored.
I figured their model was independently evaluated against the questions/answers. That's not to say it's not compromised by "Here's a bag of money" type methods, but I don't even think it'd be a reasonable test if they just handed over the dataset.
I'm sure it was independently evaluated, but I'm sure the folks running the test were not given an on-prem installation of ChatGPT to mess with. It was still done via API calls, presumably through the chat interface UI.
That means the questions went over the fence to OpenAI.
I'm quite certain they are aware of that, and it would be pretty foolish not to take advantage of at least knowing what the questions are.
Now that you put it that way, it is laughably easy.
This was my first thought when I saw the results:
Insightful comment. The thing that's extremely frustrating is look at all the energy poured into this conversation around benchmarks. There is a fundamental assumption of honesty and integrity in the benchmarking process by at least some people. But when the dataset is compromised and generation N+1 has miraculous performance gains, how can we see this as anything other than a ploy to pump up valuations? Some people have millions of dollars at stake here and they don't care about the naysayers in the peanut gallery like us.
It's sadly inevitable that when billions in funding and industry hype are tied to performance on a handful of benchmarks, scores will somehow, magically, continue to go up.
Needless to say, it doesn't bring us any closer to AGI.
The only solution I see here is people crafting their own, private benchmarks that the big players don't care about enough to train on. That, at least, gives you a clearer view of the field.
Not sure why your comment was downvoted, but it certainly shows the pressure going against people who point out fundamental flaws. This is pushing us towards "AVI" rather than AGI-- "Artificially Valued Intelligence". The optimization function here is around the market.
I'm being completely serious. You are correct, despite the downvotes, that this could not be pushing us towards AGI because if the dataset is leaked you can't claim the G-- generalizability.
The point of the benchmark is to lead is to believe that this is a substantial breakthrough. But a reasonable person would be forced to conclude that the results are misleading to due to optimizing around the training data.
I think this is a silly question, you could track AI's doing very simple maths back in 1960 - 1970's
It's just the worrisome linguistic confusion between AI and LLMs.
> I am dreading the inevitable onslaught in a year or two of language model “proofs” of the Riemann hypothesis which will just contain claims which are vague or inaccurate in the middle of 10 pages of correct mathematics which the human will have to wade through to find the line which doesn’t hold up.
I wonder what the response of working mathematicians will be to this. If the proofs look credible it might be too tempting to try and validate them, but if there’s a deluge that could be a hug time sync. Imagine if Wiles or Perelman had produced a thousand different proofs for their respective problems.
Maybe the coming onslaught of AI slop "proofs" will give a little bump to proof assistants like Coq. Of course, it would still take a human mathematician some time to verify theorem definitions.
Don't waste time on looking at it unless a formal proof checker can verify it.
> As an academic mathematician who spent their entire life collaborating openly on research problems and sharing my ideas with other people, it frustrates me [that] I am not even to give you a coherent description of some basic facts about this dataset, for example, its size. However there is a good reason for the secrecy. Language models train on large databases of knowledge, so you moment you make a database of maths questions public, the language models will train on it.
Well, yes and no. This is only true because we are talking about closed models from closed companies like so-called "OpenAI".
But if all models were truly open, then we could simply verify what they had been trained on, and make experiments with models that we could be sure had never seen the dataset.
Decades ago Microsoft (in the words of Ballmer and Gates) famously accused open source of being a "cancer" because of the cascading nature of the GPL.
But it's the opposite. In software, and in knowledge in general, the true disease is secrecy.
> But if all models were truly open, then we could simply verify what they had been trained on
How do you verify what a particular open model was trained on if you haven’t trained it yourself? Typically, for open models, you only get the architecture and the trained weights. How can you reliably verify what the model was trained on from this?
Even if they provide the training set (which is not typically the case), you still have to take their word for it—that’s not really "verification."
The OP said "truly open" not "open model" or any of the other BS out there. If you are truly open you share the training corpora as well or at least a comprehensive description of what it is and where to get it.
It seems like you skipped the second paragraph of my comment?
So here's what I'm perplexed about. There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever. These are arithmetic truths about the natural numbers. Can these statements be reached faster in ZFC? Possibly—it's well-known that there exist shorter proofs of true statements in more powerful consistent systems.
But the problem then is that one can suppose there are also true short statements in ZFC which likewise require doubly exponential time to reach via any path. Presburger Arithmetic is decidable whereas ZFC is not, so these statements would require the additional axioms of ZFC for shorter proofs, but I think it's safe to assume such statements exist.
Now let's suppose an AI model can resolve the truth of these short statements quickly. That means one of three things:
1) The AI model can discover doubly exponential length proof paths within the framework of ZFC.
2) There are certain short statements in the formal language of ZFC that the AI model cannot discover the truth of.
3) The AI model operates outside of ZFC to find the truth of statements in the framework of some other, potentially unknown formal system (and for arithmetical statements, the system must necessarily be sound).
How likely are each of these outcomes?
1) is not possible within any coherent, human-scale timeframe.
2) IMO is the most likely outcome, but then this means there are some really interesting things in mathematics that AI cannot discover. Perhaps the same set of things that humans find interesting. Once we have exhausted the theorems with short proofs in ZFC, there will still be an infinite number of short and interesting statements that we cannot resolve.
3) This would be the most bizarre outcome of all. If AI operates in a consistent way outside the framework of ZFC, then that would be equivalent to solving the halting problem for certain (infinite) sets of Turing machine configurations that ZFC cannot solve. That in itself itself isn't too strange (e.g., it might turn out that ZFC lacks an axiom necessary to prove something as simple as the Collatz conjecture), but what would be strange is that it could find these new formal systems efficiently. In other words, it would have discovered an algorithmic way to procure new axioms that lead to efficient proofs of true arithmetic statements. One could also view that as an efficient algorithm for computing BB(n), which obviously we think isn't possible. See Levin's papers on the feasibility of extending PA in a way that leads to quickly discovering more of the halting sequence.
ZFC is way worse than Presburger arithmetic -- since it is undecidable, we know that the length of the minimal proof of a statement cannot be bounded by a computable function of the length of the statement.
This has little to do with the usefulness of LLMs for research-level mathematics though. I do not think that anyone is hoping to get a decision procedure out of it, but rather something that would imitate human reasoning, which is heavily based on analogies ("we want to solve this problem, which shares some similarities with that other solved problem, can we apply the same proof strategy? if not, can we generalise the strategy so that it becomes applicable?").
2 is definitely true. 3 is much more interesting and likely true but even saying it takes us into deep philosophical waters.
If every true theorem had a proof in a computationally bounded length the halting problem would be solvable. So the AI can't find some of those proofs.
The reason I say 3 is deep is that ultimately our foundational reasons to assume ZFC+the bits we need for logic come from philosohical groundings and not everyone accepts the same ones. Ultrafinitists and large cardinal theorists are both kinds of people I've met.
> There are statements in Presburger arithmetic that take time doubly exponential (or worse) in the size of the statement to reach via any path of the formal system whatsoever.
This is a correct statement about the worst case runtime. What is interesting for practical applications is whether such statements are among those that you are practically interested in.
I would certainly think so. The statements mathematicians seem to be interested in tend to be at a "higher level" than simple but true statements like 2+3=5. And they necessarily have a short description in the formal language of ZFC, otherwise we couldn't write them down (e.g., Fermat's last theorem).
If the truth of these higher level statements instantly unlocks many other truths, then it makes sense to think of them in the same way that knowing BB(5) allows one to instantly classify any Turing machine configuration on the computation graph of all n ≤ 5 state Turing machines (on empty tape input) as halting/non-halting.
I am fairly optimistic about LLMs as a human math -> theorem-prover translator, and as a fan of Idris I am glad that the AI community is investing in Lean. As the author shows, the answer to "Can AI be useful for automated mathematical work?" is clearly "yes."
But I am confident the answer to the question in the headline is "no, not for several decades." It's not just the underwhelming benchmark results discussed in the post, or the general concern about hard undergraduate math using different skillsets than ordinary research math. IMO the deeper problem still seems to be a basic gap where LLMs can seemingly do formal math at the level of a smart graduate student but fail at quantitative/geometric reasoning problems designed for fish. I suspect this holds for O3, based on one of the ARC problems it wasn't able to solve: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... (via https://www.interconnects.ai/p/openais-o3-the-2024-finale-of...) ANNs are simply not able to form abstractions, they can only imitate them via enormous amounts of data and compute. I would say there has been zero progress on "common sense" math in computers since the invention of Lisp: we are still faking it with expert systems, even if LLM expert systems are easier to build at scale with raw data.
It is the same old problem where an ANN can attain superhuman performance on level 1 of Breakout, but it has to be retrained for level 2. I am not convinced it makes sense to say AI can do math if AI doesn't understand what "four" means with the same depth as a rat, even if it can solve sophisticated modular arithmetic problems. In human terms, does it make sense to say a straightedge-and-compass AI understands Euclidean geometry if it's not capable of understanding the physical intuition behind Euclid's axioms? It makes more sense to say it's a brainless tool that helps with the tedium and drudgery of actually proving things in mathematics.
Just a comment: the example o1 got wrong was actually underspecified: https://anokas.substack.com/p/o3-and-arc-agi-the-unsolved-ta...
Which is actually a problem I have with ARC (and IQ tests more generally): it is computationally cheaper to go from ARC transformation rule -> ARC problem than it is the other way around. But this means it’s pretty easy to generate ARC problems with non-unique solutions.
To give a sense if scale: It’s not that o3 failed to solve that red blue rectangle problem once: o3 spent thousands of gpu hours putting out text about that problem, creating by my math about a million pages of text, and did not find the answer anywhere in those pages. For other problems it did find the answer around the million page mark, as at the ~$3000 per problem spend setting the score was still slowly creeping up.
If the trajectory of the past two years is any guide, things that can be done at great compute expense now will rapidly become possible for a fraction of the cost.
The trajectory is not a guide, unless you count the recent plateauing.
it can take my math and point out a step I missed and then show me the correct procedure but still get the wrong result because it can't reliably multiply 2-digit numbers
Better than an average human then.
Different than an average human.
Every profession seems to have a pessimistic view of AI as soon as it starts to make progress in their domain. Denial, Anger, Bargaining, Depression, and Acceptance. Artists seem to be in the depression state, many programmers are still in the denial phase. Pretty solid denial here from a mathematician. o3 was a proof of concept, like every other domain AI enters, it's going to keep getting better.
Society is CLEARLY not ready for what AI's impact is going to be. We've been through change before, but never at this scale and speed. I think Musk/Vivek's DOGE thing is important, our governent has gotten quite large and bureaucratic. But the clock has started on AI, and this is a social structural issue we've gotta figure out. Putting it off means we probably become subjects to a default set of rulers if not the shoggoth itself.
Or is it just white collar workers experiencing what blue collar workers have been experiencing for decades?
So will that make society shift to the left in demand stronger of safety nets, or to the right in search of a strongman to rescue them?
The reason why this is so disruptive is because it will effect hundreds of fields simultaneously.
Previously workers in a field disrupted by automation would retrain to a different part of the economy.
If AI pans out to the point that there are mass layoffs in hundreds of sectors of the economy at once, then i’m not sure the process we have haphazardly set up now will work. People will have no idea where to go beyond manual labor. (But this will be difficult due to the obesity crisis - but maybe it will save lives in a weird way).
If there are 'mass layoffs in hundreds of sectors of the economy at once', then the economy immediately goes into Great Depression 2.0 or worse. Consumer spending is two-thirds of the US economy, when everyone loses their jobs and stops having disposable income that's literally what a depression is
This will create a prisoner’s dilemma for corporations then, the government will have to step in to provide incentives for insanely profitable corporations to keep the proper number of people employed or limit the rate of layoffs.
I think it's a little of both. Maybe generative AI algorithms won't overcome their initial limitations. But maybe we don't need to overcome them to transform society in a very significant way.
My favourite moments of being a graduate student in math was showing my friends (and sometimes professors) proofs of propositions and theorems that we discussed together. To be the first to put together a coherent piece of reasoning that would convince them of the truth was immensely exciting. Those were great bonding moments amongst colleagues. The very fact that we needed each other to figure out the basics of the subject was part of what made the journey so great.
Now, all of that will be done by AI.
Reminds of the time when I finally enabled invincibility in Goldeneye 007. Rather boring.
I think we've stopped to appreciate the human struggle and experience and have placed all the value on the end product, and that's we're developing AI so much.
Yeah, there is the possibility of working with an AI but at that point, what is the point? Seems rather pointless to me in an art like mathematics.
> How much longer this will go on for nobody knows, but there are lots of people pouring lots of money into this game so it would be a fool who bets on progress slowing down any time soon.
Money cannot solve the issues faced by the industry which mainly revolves around lack of training data.
They already used the entirety of the internet, all available video, audio and books and they are now dealing with the fact that most content online is now generated by these models, thus making it useless as training data.
Who is the author?
Kevin Buzzard
At this stage I assume everything having a sequencial pattern can and will be automated by LLM AIs.
I think that’s provably incorrect for the current approach to LLMs. They all have a horizon over which they correlate tokens in the input stream.
So, for any LLM, if you intersperse more than that number of ‘X’ tokens between each useful token, they won’t be able to do anything resembling intelligence.
The current LLMs are a bit like n-gram databases that do not use letters, but larger units.
It’s that a bit of an unfair sabotage?
Naturally, humans couldn’t do it, even though they could edit the input to remove the X’s, but shouldn’t we evaluate the ability (even intelligent ability) of LLM’s on what they can generally do rather than amplify their weakness?
The follow-up question is "Does it require a paradigm shift to solve it?". And the answer could be "No". Episodic memory, hierarchical learnable tokenization, online learning or whatever works well on GPUs.
At this stage I hope everything that needs to be reliable won't be automated by LLM AIs.
How to train an AI strapped to a formal solver.
> FrontierMath is a secret dataset of “hundreds” of hard maths questions, curated by Epoch AI, and announced last month.
The database stopped being secret when it was fed to proprietary LLMs running in the cloud. If anyone is not thinking that OpenAI has trained and tuned O3 on the "secret" problems people fed to GPT-4o, I have a bridge to sell you.
This level of conspiracy thinking requires evidence to be useful.
Edit: I do see from your profile that you are a real person though, so I say this with more respect.
What evidence do we need that AI companies are exploiting every bit of information they can use to get ahead in the benchmarks to generate more hype? Ignoring terms/agreements, violating copyright, and otherwise exploiting information for personal gain is the foundation of that entire industry for crying out loud.
When did we decide that AI == LLM? Oh don't answer. I know, The VC world noticed CNNs and LLMs about 10 years ago and it's the only thing anyone's talked about ever since.
Seems to me the answer to 'Can AI do maths yet?' depends on what you call AI and what you call maths. Our old departmental VAX running at a handfull of megahertz could do some very clever symbol manipulation on binomials and if you gave it a few seconds, it could even do something like theorum proving via proto-prolog. Neither are anywhere close to the glorious GAI future we hope to sell to industry and government, but it seems worth considering how they're different, why they worked, and whether there's room for some hybrid approach. Do LLMs need to know how to do math if they know how to write Prolog or Coc statements that can do interesting things?
I've heard people say they want to build software that emulates (simulates?) how humans do arithmetic, but ask a human to add anything bigger than two digit numbers and the first thing they do is reach for a calculator.
No it can't, and there's no such thing as AI. How is a thing that predicts the next-most-likely word going to do novel math? It can't even do existing math reliably because logical operations and statistical approximation are fundamentally different. It is fun watching grifters put lipstick on this thing and shop it around as a magic pig though.
Betteridge's Law applies.
[dead]
[dead]
"once" the training data can do it, LLMs will be able to do it. and AI will be able to do math once it comes to check out the lights of our day and night. until then it'll probably wonder continuously and contiguously: "wtf! permanence! why?! how?! by my guts, it actually fucking works! why?! how?!"
I do think it is time to start questioning whether the utility of ai solely can be reduced to the quality of the training data.
This might be a dogma that needs to die.
If not bad training data shouldn’t be problem
There can be more than one problem. The history of computing (or even just the history of AI) is full of things that worked better and better right until they hit a wall. We get diminishing returns adding more and more training data. It’s really not hard to imagine a series of breakthroughs bringing us way ahead of LLMs.
I tried. I don't have the time to formulate and scrutinise adequate arguments, though.
Do you? Anything anywhere you could point me to?
The algorithms live entirely off the training data. They consistently fail to "abduct" (inference) beyond any language-in/of-the-training-specific information.
The best way to predict the next word is to accurately model the underlying system that is being described.
It is a gradual thing. Presumably the models are inferring things on runtime that was not a part of their training data.
Anyhow, philosophically speaking you are also only exposed to what your senses pick up, but presumably you are able to infer things?
As written: this is a dogma that stems from a limited understanding of what algorithmic processes are and the insistence that emergence can not happen from algorithmic systems.
AWS announced 2 or 3 weeks a way of formulating rules into a formal language.
AI doesn't need to learn everything, our LLM Models already contain EVERYTHING. Including ways of how to find a solution step by step.
Which means, you can tell an LLM to translate whatever you want, into a logical language and use an external logic verifier. The only thing a LLM or AI needs to 'understand' at this point is to make sure that the statistical translation from left to right is high enough.
Your brain doesn't just do logic out of the box, You conclude things and formulate them.
And plenty of companies work on this. Its the same with programming, if you are able to write code and execute it, you execute it until the compiler errors are gone. Now your LLM can write valid code out of the box. Let the LLM write unit tests, now it can verify itself.
Claude for example offers you, out of the box, to write a validation script. You can give claude back the output of the script claude suggested to you.
Don't underestimate LLMs
Is this the AWS thing you referenced? https://aws.amazon.com/what-is/automated-reasoning/
As far as ChatGPT goes, you may as well be asking: Can AI use a calculator?
The answer is yes, it can utilize a stateful python environment and solve complex mathematical equations with ease.
There is a difference between correctly stating that 2 + 2 = 4 within a set of logical rules and proving that 2 + 2 = 4 must be true given the rules.
I think you misunderstood, ChatGPT can utilize Python to solve a mathematical equation and provide proof.
https://chatgpt.com/share/676980cb-d77c-8011-b469-4853647f98...
More advanced solutions:
https://chatgpt.com/share/6769895d-7ef8-8011-8171-6e84f33103...
It still has to know what to code in that environment. And based on my years of math as a wee little undergrad, the actual arithmetic was the least interesting part. LLM’s are horrible at basic arithmetic, but they can use python for the calculator. But python wont help them write the correct equations or even solve for the right thing (wolfram alpha can do a bit of that though)
You’ll have to show me what you mean.
I’ve yet to encounter an equation that 4o couldn’t answer in 1-2 prompts unless it timed out. Even then it can provide the solution in a Jupyter notebook that can be run locally.
I can't reliably multiply four digit numbers in my head either, what's your point?
Nobody said you have to do it in your head.
That's the equivalent to what we are asking the model to do. If you give the model a calculator it will get 100%. If you give it a pen and paper (e.g. let it show it's working) then it will get near 100%.
Ai has a interior world model thus it can do math if a chain of proof is walking without uncertainty from room to room. the problem is its inability to reflect on its own uncertainty and to then overrife that uncertainty ,should a new room entrance method be selfsimilar to a previous entrance
I may be wrong, but I think it a silly question. AI is basically auto-complete. It can do math to the extent you can find a solution via auto-complete based on an existing corpus of text.
You're underestimating the emergent behaviour of these LLM's. See for example what Terrence Tao thinks about o1:
I'm always just so pleased that the most famous mathematician alive today is also an extremely kind human being. That has often not been the case.
Pretty sure this is out of date now
[flagged]
Why would others provide proofs when you are yourself posting groundless opinions as facts in this very thread?
> AI is basically
Very many things conventionally labelled in the 50's.
You are speaking of LLMs.
Yes - I mean only to say "AI" as the term is commonly used today.
Humans can autocomplete sentences too because we understand what's going on. Prediction is a necessary criterion for intelligence, not an irrelevant one.
I haven't checked in a while, but last I checked ChatGPT it struggled on very basic things like: how many Fs are in this word? Not sure if they've managed to fix that but since that I had lost hope in getting it to do any sort of math
I understand the appeal of having a machine helping us with maths and expanding the frontier of knowledge. They can assist researchers and make them more productive. Just like they can make already programmers more productive.
But maths are also fun and fulfilling activity. Very often, when we learn a math theory, it's because we want to understand and gain intuition on the concepts, or we want to solve a puzzle (for which we can already look up the solution). Maybe it's similar to chess. We didn't develop search engines to replace human players and make them play together, but they helped us become better chess players or understanding the game better.
So the recent progress is impressive, but I still don't see how we'll use this tech practically and what impacts it can have and in which fields.
I wish scientists who do psychology and cognition of actual brains could approach those AI things and talk about it, and maybe make suggestions.
I really really wish AI would make some breakthrough and be really useful, but I am so skeptical and negative about it.
Unfortunately, the scientists who study actually brains have all sort of interesting models but ultimately very little clue how these actual brains work at the level of problem solving. I mean, there's all sort of "this area is associated with that kind of process" and "here's evidence this area does this algorithm" stuff but it's all at the level you imagine steam engine engineers trying to understand a warp drive.
The "open worm project" was an effort years ago to get computer scientists involved in trying to understand what "software" a very small actual brain could run. I believe progress here has been very slow and that an idea of ignorance that much larger brains involve.
If you can't find useful things for LLMs or AI at this point, you must just lack imagination.