There are two kinds of computing - precision computing and probabilistic computing. For example, cryptography falls into precision computing. There is no room for being incorrect even by a single bit. Where as machine learning is about getting a range of answers, with tolerance for error.
I like to visualize them as cuts and spans in a continuum, such as a number line. They make up the full picture. One exists only because of the other. One can't do the job of the other and one is defined only in terms of the other.
Banks wouldn't use AI to compute the account balance after a transaction or for authenticating a customer. Network software wouldn't use AI for encryption and decryption of the TLS traffic. Also, banks wouldn't mind a x% error in computation of a credit rating, fraud detection or industry trends analysis.
Writing code is a probabilistic task with many variations possible, while the work done by the code during runtime, is a precision task, in most of the cases.
Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
Yes, I also wonder about this! Progress from children books to scientific papers etc.
Could it learn e.g. language structure faster in a pre-training stage?
Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.
Would a single human/entity learn more in ..say.. three million years or would short lived ones evolving over three million years and then ~20 years of education learn more?
The current AI tech cycle is focusing on the first, but we don't really know if there are benefits of both.
The question is whether, if the models plateau, and "AGI" as it was claimed in the beginning never arrives, if it's enough to justify these ongoing multi-hundred billion dollar deals.
I mean, probably, LLMs as they are today are already changing the world. But I do think a lot of the ongoing investment is propped up on the promise of another breakthrough that is looking less likely.
That’s a very interesting take. I hadn’t really considered evolution
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
If we make some massive physics breakthrough tommrow is an LLM going to be able to fully integrate that into its current data set?
Or will we need to produce a host of documents and (re)train a new one in order for the concept to be deeply integrated.
This distinction is subtle but lost on many who think that our current path will get us to AGI...
That isn't to say we haven't created a meaningful tool but the sooner we get candid and realistic about what it is and how it works the sooner we can get down to the business of building practical applications with it. (And as an aside scaling it, something we arent doing well with now).
What is the subtle distinction? I'm "many" and it's not clear at all here. If we had some massive physics breakthrough, the LLM needs to be tought about it, but so do people. Teaching people about it would involve producing a host of documents in some format but that's also true of teaching people. Training and learning here seem to be opposite ends of the same verb no matter the medium, but I'm open to being enlightened.
Not sure exactly what the parent comment intended, but it does seem to me that it's harder for an LLM to undergo a paradigm shift than for humans. If some new scientific result disproves something that's been stated in a whole bunch of papers, how does the model know that all those old papers are wrong? Do we withhold all those old papers in the next training run, or apply a super heavy weight somehow to the new one, or just throw them all in the hopper and hope for the best?
The amount of paths in the wrong direction are infinitely more than then number in the right direction. You'll quickly realize this doesn't actually scale.
I'm a bit confused by this; are you referring to vanishing/exploding gradients during training or iteration at inference? If the former, this is only true if you take too many steps. If the latter, we already know this works and scales well.
The latter, and I would disagree that “this works and scales well” in the general sense. It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.
> It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.
> akin to taking a few more stabs at RNG
Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.
I never made a claim that it's ineffective, just that it's of limited effectiveness. The diminishing returns kick in quickly, and it's not applicable in more domains than it is applicable.
The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
But it's very hard to define "real-world CoT" -- think about human, we learn multiplications by vertical calculation and we learn division in a similar way -- all these learning process requires an "information dense" tools (calculation process) with intrinsic math rules in it. Isn't that an adapted way of CoT?
Computers are already fast and efficient at multiplication - optimized long ago. Transformers are fast and efficient at working with sequences of tokens. Tools are not universal. A hammer is not a good violin bow. A MRI machine is not a good relational database. This extends to the natural world too. A zebra is not a good dairy animal. And a human poet may or may not be a good surgeon. It’s good to explore what things can do beyond their intrinsic nature - but expect to encounter limits eventually.
I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Well, if the model can reliably keep in context CPU cache plus CPU registers plus CPU instructions and is able to do operations based on those, then we pretty much solved computation using LLMs, right? It could use RAG to operate on RAM and SSD.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU:
> Cache: ~40-100 MB total (L1 + L2 + shared L3)
> Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so)
> Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
It doesn't work like mapping CPU caches/registers into an LLM context. Transformers have no mutable registers, they attend over past tokens and can't update prior state. RAG isn't RAM. Even with huge context, you still can't step CPU style instructions without an external, read/write memory/tooling.
And temperature 0 makes outputs deterministic, not magically correct.
> And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
Well, the LLM may re-infer the whole state fully on every instruction. Temperature 0 is deterministic and that's what we are looking for. If the model is trained properly on how the CPU state + instructions should be handled, then it should be able to produce the next state.
With temp = 0 if the model is off by one bit at step k, all subsequent steps are deterministically wrong.
Your previous example shows the best case, which is a model can sometimes follow a textual recipe for long multiplication on short inputs. That's not the same as learning a length generalizing bit exact algorithm.
Basically what you shown is the model can describe the algorithm. It doesn't show it can execute it at scale. Without writable state and bit exact ops, errors grow with length and "focus more" only slows that failure, it doesn’t eliminate it.
The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.
This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
True. Generally it seems like you're visualizing things, moving stuff around, seeing vague patterns and trying to make them more clear. IDK how a transformer architecture would fit all of that in its context, or use it productivity once it's there. You can't just keep appending forever, but you also can't delete stuff either, because unlike humans, a deletion is a hard delete; there's no fuzzy remembrance left to rely on, so even deleting bad ideas is dangerous because it'll forget that it was a bad idea and infinite loop. Symbols manipulation doesn't come until the end, after you have a good idea what that part will look like.
Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
I agree with you, seems like we are trying to make the shoe fit. Not only are we missing the understanding of what is happening inside transformers, but now we are trying to teach them and see how they respond and then interpret it. That seems fine with viruses and animals, but we are talking about a piece of software here. Shouldn't we know what's happening inside? Maybe these kinds of papers can shine more light and give us better understanding though, still it feels backwards to me...Regarding the multiplication itself, shouldn't pure understanding of the meaning of multiplication(it's a summation basically) be enough for 'AI' to call it a day? If AI or human understands that, then the rest is computation part. We already got that covered, so instead of having 'AI' learn it on its own on crazy amount of data and get it right 99% of time, shouldn't we just give it a calculator? Somebody PLEEAASE give this AI a calculator :-)
I think you might be missing some appropriate context. I agree that it is ridiculous to expect a language model to be good at symbolic manipulation; that is best done with tool use. However, there is a significant line of work dedicated to algorithm discovery for mathematical problems using neural networks. Transformers are used here due to their popularity, but also some theoretical analysis to suggest that they are the among the most efficient architecture for learning automata. It's still unclear whether this is truly sound though, which is where this kind of research matters.
There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
If being probabilistic prevented learning deterministic functions, transformers couldn’t learn addition either. But they can, so that can't be the reason.
Are you sure? I bet you if you pull 10 people off the street and ask them to multiply 5 digit by 5 digit numbers by hand, you won't have a 100% success rate.
Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
Numbers aren't language, or even sequences of tokens, or vectors.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
No. Math and especially numbers are not just symbol manipulation. Geometry is a counter-example. So is multiplication, for that matter.
Maybe you could say that algebra is just symbol manipulation.
And in any case - "set of rules" is exactly what transformers aren't good at. Transformers are good at capturing the essence of what you meant and responding in a sensible, but not rule-bound way. This works well for language problems.
Perhaps you could argue that transformers are just a set of rules (weights/parameters) being applied, and you might similarly argue that numbers reduce to logical symbols like S(0), S(S(0)), but then I'd argue that you're missing the point.
Even worse: Why cannot programming languages learn arithmetic?
Most languages and its stdlib's cannot deal with numbers properly at all. Most overflow without errors. Most integers cannot keep precision, most cannot promote types properly.
I only know of Common Lisp, Scheme, Python 3, Ruby, Erlang, Haskell, Raku which can handle numbers properly by default. Python extremely slow though.
I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.
That's very cool, but it's not an apples to apples comparison. The reasoning model learned how to do long multiplication. (Either from the internet, or from generated examples of long multiplication that were used to sharpen its reasoning skills. In principle, it might have invented it on its own during RL, but no, I don't think so.)
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
IMO, the mystery has a simple explanation: addition is mostly local in nature, when the 5th digit in the input impacts only 5th or 4th digits in the output, while multiplication is not. That being said, LLMs don't understand addition either: the illusion will break down on very large inputs.
There are two kinds of computing - precision computing and probabilistic computing. For example, cryptography falls into precision computing. There is no room for being incorrect even by a single bit. Where as machine learning is about getting a range of answers, with tolerance for error.
I like to visualize them as cuts and spans in a continuum, such as a number line. They make up the full picture. One exists only because of the other. One can't do the job of the other and one is defined only in terms of the other.
Banks wouldn't use AI to compute the account balance after a transaction or for authenticating a customer. Network software wouldn't use AI for encryption and decryption of the TLS traffic. Also, banks wouldn't mind a x% error in computation of a credit rating, fraud detection or industry trends analysis.
Writing code is a probabilistic task with many variations possible, while the work done by the code during runtime, is a precision task, in most of the cases.
Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
Would like to see a car that moved like a horse.
Technically internal combustion engine has piston moving like horse legs.
yeah me too that would be fucking awesome, are you kidding?
Yes, I also wonder about this! Progress from children books to scientific papers etc. Could it learn e.g. language structure faster in a pre-training stage? Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.
This field of study is known as "Curriculum Learning" for your Googling pleasure (or I guess ChatGPT Deep Research now).
Probably don’t need the name of the field for ChatGPT to get it.
There's an interesting question here.
Would a single human/entity learn more in ..say.. three million years or would short lived ones evolving over three million years and then ~20 years of education learn more?
The current AI tech cycle is focusing on the first, but we don't really know if there are benefits of both.
There's no obvious way to combine these yet.
Isn’t that what all the hundreds of billions are banking on? “General” intelligence.
You don't need general intelligence to make good memes to keep people scrolling through Instagram.
You don't need general intelligence to make a decent coding tool like Cursor.
You don't need general intelligence to improve SERPs.
You don't need general intelligence to sell a subscription for a decent AI assistant.
There's tons of value already added without anything general.
Yes but $500B and counting for memes wasn’t what was sold
I remember reading somewhere someone said "the problem with AI is it's a $50b industry pretending its a $10t industry"
The question is whether, if the models plateau, and "AGI" as it was claimed in the beginning never arrives, if it's enough to justify these ongoing multi-hundred billion dollar deals.
I mean, probably, LLMs as they are today are already changing the world. But I do think a lot of the ongoing investment is propped up on the promise of another breakthrough that is looking less likely.
"an architecture that learned more like humans"
i.e. enduring countless generations of evolutionary selection and cross breeding, then fine-tuning a bit?
although it could be interesting, i don't think training on progressively complex strings entirely recapitulates this.
That’s a very interesting take. I hadn’t really considered evolution
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
[dead]
"Would love to see an architecture that learned"
Would be a far more accurate statement. Training != Learning.
Do you have an example of an algorithm that learns, rather than is trained/trains itself? I don’t really see the boundary between the two concepts.
If we make some massive physics breakthrough tommrow is an LLM going to be able to fully integrate that into its current data set?
Or will we need to produce a host of documents and (re)train a new one in order for the concept to be deeply integrated.
This distinction is subtle but lost on many who think that our current path will get us to AGI...
That isn't to say we haven't created a meaningful tool but the sooner we get candid and realistic about what it is and how it works the sooner we can get down to the business of building practical applications with it. (And as an aside scaling it, something we arent doing well with now).
What is the subtle distinction? I'm "many" and it's not clear at all here. If we had some massive physics breakthrough, the LLM needs to be tought about it, but so do people. Teaching people about it would involve producing a host of documents in some format but that's also true of teaching people. Training and learning here seem to be opposite ends of the same verb no matter the medium, but I'm open to being enlightened.
Not sure exactly what the parent comment intended, but it does seem to me that it's harder for an LLM to undergo a paradigm shift than for humans. If some new scientific result disproves something that's been stated in a whole bunch of papers, how does the model know that all those old papers are wrong? Do we withhold all those old papers in the next training run, or apply a super heavy weight somehow to the new one, or just throw them all in the hopper and hope for the best?
[dead]
Given their names I'd say they're too busy optimising primes...
Take your damned upvote, and go away.
Hmm do the winds favor an even/odd cycle of votes..
A while back I saw a post where people ran a model over and over to accomplish a code base port from one language to another.
In their prompt, they told it to leave itself a note and to accomplish something each time.
Then they put the model in a loop and it worked. In one instance, a model removed itself from the loop by editing a file or some other basic means.
To me, iterative tasks like like multiply and long divide, look an awful lot like the code port experiment.
Putting models into loops so they get more than one bite at the task seems to be a logical progression to improve capability.
The amount of paths in the wrong direction are infinitely more than then number in the right direction. You'll quickly realize this doesn't actually scale.
I'm a bit confused by this; are you referring to vanishing/exploding gradients during training or iteration at inference? If the former, this is only true if you take too many steps. If the latter, we already know this works and scales well.
The latter, and I would disagree that “this works and scales well” in the general sense. It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.
> It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.
> akin to taking a few more stabs at RNG
Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.
I never made a claim that it's ineffective, just that it's of limited effectiveness. The diminishing returns kick in quickly, and it's not applicable in more domains than it is applicable.
Achieving agi is not a requirement to working well.
How do you know if you've taken too many steps beforehand?
I am not sure it needs to scale.
The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
But it's very hard to define "real-world CoT" -- think about human, we learn multiplications by vertical calculation and we learn division in a similar way -- all these learning process requires an "information dense" tools (calculation process) with intrinsic math rules in it. Isn't that an adapted way of CoT?
Computers are already fast and efficient at multiplication - optimized long ago. Transformers are fast and efficient at working with sequences of tokens. Tools are not universal. A hammer is not a good violin bow. A MRI machine is not a good relational database. This extends to the natural world too. A zebra is not a good dairy animal. And a human poet may or may not be a good surgeon. It’s good to explore what things can do beyond their intrinsic nature - but expect to encounter limits eventually.
I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?
> It DOES fail more when the numbers are longer (because it results with more text in the context),
I tried to raise this question yesterday. https://news.ycombinator.com/item?id=45683113#45687769
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Well, if the model can reliably keep in context CPU cache plus CPU registers plus CPU instructions and is able to do operations based on those, then we pretty much solved computation using LLMs, right? It could use RAG to operate on RAM and SSD.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU: > Cache: ~40-100 MB total (L1 + L2 + shared L3) > Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so) > Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
It doesn't work like mapping CPU caches/registers into an LLM context. Transformers have no mutable registers, they attend over past tokens and can't update prior state. RAG isn't RAM. Even with huge context, you still can't step CPU style instructions without an external, read/write memory/tooling.
And temperature 0 makes outputs deterministic, not magically correct.
> And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
It's partly because floating point math is not associative and GPU inference doesn't guarantee all the steps will be done in the same order.
Well, the LLM may re-infer the whole state fully on every instruction. Temperature 0 is deterministic and that's what we are looking for. If the model is trained properly on how the CPU state + instructions should be handled, then it should be able to produce the next state.
With temp = 0 if the model is off by one bit at step k, all subsequent steps are deterministically wrong.
Your previous example shows the best case, which is a model can sometimes follow a textual recipe for long multiplication on short inputs. That's not the same as learning a length generalizing bit exact algorithm.
Basically what you shown is the model can describe the algorithm. It doesn't show it can execute it at scale. Without writable state and bit exact ops, errors grow with length and "focus more" only slows that failure, it doesn’t eliminate it.
They're not any better at addition, are they? If they are, I wonder how good they are at adding numbers in log space.
The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
> but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.
The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.
This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
True. Generally it seems like you're visualizing things, moving stuff around, seeing vague patterns and trying to make them more clear. IDK how a transformer architecture would fit all of that in its context, or use it productivity once it's there. You can't just keep appending forever, but you also can't delete stuff either, because unlike humans, a deletion is a hard delete; there's no fuzzy remembrance left to rely on, so even deleting bad ideas is dangerous because it'll forget that it was a bad idea and infinite loop. Symbols manipulation doesn't come until the end, after you have a good idea what that part will look like.
Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
I agree with you, seems like we are trying to make the shoe fit. Not only are we missing the understanding of what is happening inside transformers, but now we are trying to teach them and see how they respond and then interpret it. That seems fine with viruses and animals, but we are talking about a piece of software here. Shouldn't we know what's happening inside? Maybe these kinds of papers can shine more light and give us better understanding though, still it feels backwards to me...Regarding the multiplication itself, shouldn't pure understanding of the meaning of multiplication(it's a summation basically) be enough for 'AI' to call it a day? If AI or human understands that, then the rest is computation part. We already got that covered, so instead of having 'AI' learn it on its own on crazy amount of data and get it right 99% of time, shouldn't we just give it a calculator? Somebody PLEEAASE give this AI a calculator :-)
I think you might be missing some appropriate context. I agree that it is ridiculous to expect a language model to be good at symbolic manipulation; that is best done with tool use. However, there is a significant line of work dedicated to algorithm discovery for mathematical problems using neural networks. Transformers are used here due to their popularity, but also some theoretical analysis to suggest that they are the among the most efficient architecture for learning automata. It's still unclear whether this is truly sound though, which is where this kind of research matters.
Language _is_ the symbolic manipulation system par excellence though.
There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
Because they produce output probabilistically, when multiplication is deterministic. Why is this so hard for everyone?
If being probabilistic prevented learning deterministic functions, transformers couldn’t learn addition either. But they can, so that can't be the reason.
People are probabilistic, and I've been informed that people are able to perform multiplication.
Are you sure? I bet you if you pull 10 people off the street and ask them to multiply 5 digit by 5 digit numbers by hand, you won't have a 100% success rate.
Yes, and unlike the LLM they can iterate on a problem.
When I multiply, I take it in chunks.
Put the LLM into a loop, instruct it to keep track of where it is and have it solve a digit at a time.
I bet it does just fine. See my other comment as to why I think that is.
[dead]
Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
yes, what your describing is not a transformer but a high-level LLM-based product with tool-calling wired up to it
That doesn't appear to be the kind of thing this article is describing.
Numbers aren't language, or even sequences of tokens, or vectors.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
Math is just symbol manipulation with a set of rules, no?
No. Math and especially numbers are not just symbol manipulation. Geometry is a counter-example. So is multiplication, for that matter.
Maybe you could say that algebra is just symbol manipulation.
And in any case - "set of rules" is exactly what transformers aren't good at. Transformers are good at capturing the essence of what you meant and responding in a sensible, but not rule-bound way. This works well for language problems.
Perhaps you could argue that transformers are just a set of rules (weights/parameters) being applied, and you might similarly argue that numbers reduce to logical symbols like S(0), S(S(0)), but then I'd argue that you're missing the point.
Even worse: Why cannot programming languages learn arithmetic?
Most languages and its stdlib's cannot deal with numbers properly at all. Most overflow without errors. Most integers cannot keep precision, most cannot promote types properly.
I only know of Common Lisp, Scheme, Python 3, Ruby, Erlang, Haskell, Raku which can handle numbers properly by default. Python extremely slow though.
What probably works: Ask it to write a python program, but tell it to not use any built-in multiplication functions.
Then your transformer would need to know Python.
I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.
Yesterday, I learned the opposite. Simon Willison demonstrated in another thread how this works out … see https://news.ycombinator.com/item?id=45686295
That's very cool, but it's not an apples to apples comparison. The reasoning model learned how to do long multiplication. (Either from the internet, or from generated examples of long multiplication that were used to sharpen its reasoning skills. In principle, it might have invented it on its own during RL, but no, I don't think so.)
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
IMO, the mystery has a simple explanation: addition is mostly local in nature, when the 5th digit in the input impacts only 5th or 4th digits in the output, while multiplication is not. That being said, LLMs don't understand addition either: the illusion will break down on very large inputs.
Maybe the AGI will come with the equivalent of a "Turing Machine" enabling some kind of computability.
[dead]