Computer Architecture and AIs
I think I understand the issues with LLMs now and their massive need for hardware, power, cooling and everything. It can be summarised as this :
Current hardware is entirely inappropriate for running an LLM. You are effectively running a software emulation of a system instead of running a hardware system.
This is easy to understand emulation wise. Any emulation is very inefficient compared to running against the thing itself for lots of reasons. If you understand this stuff then you know, software, memory shunting, simulation of physical, is orders of magnitude slower than electrons interacting at a custom designed hardware layer. Not to mention the overheads of whatever architecure you're running your inefficient emulation on and its requirements. I can remember how long it took for much more powerful PCs to catch up to an old school Amiga from the 16 bit era via emulation. Ridiculous. But. Exacerbated because the architectures were very different. One was parallel. One was serial. And the serial had to be awesomely more powerful to produce the same kind of output the old parallel hardware could manage. Not to mention the Amiga had some sublime specialised silicon, that the generalist PC had to chuff hard at to emulate. The point here is. That emulation is very expensive. And the worse a fit you are from what you're emulating to what you're emulating ON, the more costly it becomes.
Current LLMs are a real kludge for the hardware we have. Current computer hardware has evolved around a completely different paradigm of what a computing process should look like ( Von Neumann architecture ). With memory over there. A CPU here. A GPU over there. Shunted around by an explicit bus. Some of these architectural pillars have evolved over time - a steady enfattening of the bus to try to keep up with the increased need for shuttling. Or like moving the GPU compute and storage to its own local neighbourhood to avoid the shuttling period. Because a) the massively parallel fast dedicated GPU is orders of magnitude better than a serial generalised cpu at mathy and b) having your data accessible on hand is orders of magnitude faster than having it all the way over there and demanding it take the bus in rush hour to get here. So. GPUs become the dominant form of compute handling for massively parallel tasks - traditionally graphics, but also other mathy stuff. Like physics sims. And crypto busting. But effectively. These are just spiralling out into sub Von Neumann architectures. A machine. In the machine. Your GPU ends up being its own little fiefdom. That talks with its King.
LLMs take Von Neumann architecture to breaking point. They don't fit that paradigm. What they need is massively scalable memory local to compute. And. Ideally. What they want, is no movement of data at all - because its stupid expensive and slow to do at scale at high churn rates. And GPU architecture wont save you here either. Because its the same clapped out architecture. But parallel instead of serial. And with a shorter fatter memory pipe ( so in modern computer terms, these are the best fit you have for desktop LLMs, but are still kind of crap and still don't scale ). Memory frameworks don't save you either, like on the chunky MI300 series server hardware that massively scale up memory for hungry LLMs, but they too, cap out, and don't properly scale. Same issue. What LLMs want is a LOT of tiny amounts of compute wired directly into memory. Because they ARE massive scale and have massive churn too. So. Compute. And memory. Together.
When you step back and look at this, what you're describing is a neuron. Small compute. Local memory. Millions of the buggers. In parallel.
This is how the average idiot human brain crunches relatively massive workloads for hours for the power output of a bacon sandwich.
What LLMs currently do is effectively run a neural net - all those neurons, in an emulation which is horrible at scale. And even small LLMs do that at a bonkers scale. Because you're shuttling all that data back and forth, passing it to your compute, passing it back and so on. God awful.
God awful means, super hungry. Super hot. Super big. Mucho expensive.
In a way. This is like old school computers. A computer the size of a warehouse. Gobbling down all the power, all the components, to come up with the solution of a simple math sum.
Todays best architecture for LLMs resides in fancy memory framework setups, that put a lot of memory onto a hardware framework that doesn't have terrible bottlenecks like PCIE and localises things a lot better.
But. They still follow the old school computer designs. And they don't scale. Can't scale. And everything is horribly bottlenecked by hardware that just wasn't designed for this shit.
You need that silicon equivalent, or thereabouts, of a neuron.
You need hardware that reflects what you're doing. And not an emulation stuffed into inappropriate and slow hardware.
This also reflected early computer design. What are you trying to do. Design hardware to fit that.
LLMs expose a real novel problem area. One that to date computing hasn't had to deal with. It requires new hardware paradigms.
The old ways of doing things can get increasingly patched, hacked, and janky workarounds layered on. But. It doesn't properly scale. You need that shift.
The early thoughts about this are along the lines of CIM - Compute in Memory. Where compute is directly housed with memory. This contrasts to current Von Neumann designs - where everything shuttles back and forth along a bus.
However. When you think about what "compute in memory" actually means. This is a goddamn neuron. Compute. Embedded with your memory. No shuttling.
Uh huh.
Once you see it. It's obvious. You're working with the same constraints as biology. How do I do massive churny compute at scale, in order to compress all that stimuli data into a multi dimensional store, without requiring the calorific equivalent of eating an entire planet everyday. What we're doing at the moment is brute forcing it. But you can see the same constraints end up shaping the answer in the same direction - towards something that roughly looks like a neuron. You are. Engineering a brain. From first principles. Kind of awesome.
I think also. You can see a little into the future with this. If we imagine a future that's going to be dominated by LLMs - and I think it will. Then computer architecture radically changes. They will move away from Von Neumann machines, to more like silicon brains. Or perhaps you'll get dual use machines. With the Von Neumann bit in increasingly smaller and smaller self contained packages. But the bulk of it will be this lattice of CIM - silicon neurons. I think, in that future, todays computers will look as awkward and archaic, as 8 bit computers from 45 years ago look to us today. There won't be a cpu. And a memory stick. There will be a neural stack. You'll still likely have a static storage device of some kind. Because it scales massively. And is cheap as chips. Everything else ? Stuffed into your Von Neumann package.
Todays cpus are massively more powerful than what came before. If you look at modern cpu design, compared to those things in the past, the cpu package itself comprises of more compute than everything else found in several computers put together and then some, and here's another crucial bit, the memory embedded in a package is also larger than anything you used to find in a discrete setup. I can remember the days of having a 33 Mhz cpu with 4MB of ram. These days I have 16 cpus, running at 5ghz, with 128MB all sitting inside one cpu package. Effectively. Those old systems have condensed into a single self contained package including the memory. Actual discrete memory on new systems is now a few orders of magnitude larger, so, my current gaming rig with 64GB not only trumps old system RAM, but also old system storage, where my first storage was something of the order of 120MB. You can kinda see the progression here. What used to be discrete RAM is now bigger and directly in the cpu. What used to be discrete storage is now bigger and available as RAM. And discrete storage is now unimaginably enormous compared to old systems. Each one has progressed one step up the chain.
You can see from that progression that you can end up putting entire systems into a single condensed package. I wouldn't wonder that the current PC systems we have now, doesn't also get compressed down into that single package. Freeing up the bulk of the rest of your local computer for the new CIM architecture instead. Take that thing a step further again. You might expect - a package that has 50 or more cpus. A local cache RAM of gigabytes. And I'd take a guess ends up the size of your hand. And a discrete storage of petabytes or even exabytes. The size of half your finger.
But who knows. Old ways can sometimes stick around a lot. And it depends on how much split of work gets routed through LLMs versus old school explicit Von Neumanns. My hunch is. Almost everything ends up going LLM route. And your Von Neumann. Ends up being like a co pro, thats really good at math.
But timescales. We're talking 20 years out I think to be dominant. Past my sell by date for sure. But maybe 5 years you'll see early example if they're very quickly developed. More like 10 years I would hazard a guess. It depends how much pressure there is for it. I think if someone designs something that works really well. The change will happen fast.
Comments
Post a Comment