LLMs are a failure. A new AI winter is coming.

Though LLMs had a lot of promise, this has not been demonstrated in practice. The technology is essentially a failure, and a new AI winter is coming.

AI has failed. Winter is coming!

Like many people, I got pretty excited when it was discovered that the transformer neural network architecture appeared to break through many years of stagnation in AI research. Chatbots suddenly had emergent capabilities, derived almost entirely from unstructured, unsupervised learning, far surpassing older technologies.

My first experiences were with unreleased models, pre-ChatGPT, and I was seriously impressed. Though these early, small, models would often mess up, even generating streams of garbage text, when they worked they worked. Spookily well. I completely understand why some people at the time thought they were sentient – this is a whole other discussion for another time.

People were saying that this meant that the AI winter was over, and a new era was beginning. I should explain for anyone who hasn't heard that term before, that way back in the day, when early AI research was seemingly yielding significant results, there was much hope, as there is now, but ultimately the technology stagnated. First time around, AI was largely symbolic – this basically means that attempts to model natural language understanding and reasoning were based essentially on hard-coded rules. This worked, up to a point, but it was soon clear that it was simply impractical to build a true AI that way. Human language is too messy for mechanised parsing to work in a general way. Reasoning required far too much world knowledge for it to be practical to write the code by hand, and nobody knew how to extract that knowledge without human intervention.

The other huge problem with traditional AI was that many of its algorithms were NP-complete, which meant that whilst a lot of the time you got a result, often you just didn't, with the algorithm taking an arbitrarily long time to terminate. I doubt anyone can prove this – I certainly wouldn't attempt it – but I strongly suspect that 'true AI', for useful definitions of that term, is at best NP-complete, possibly much worse. Though quantum computing in principle could give some leverage here, none of the technologies currently being built or that are being considered feasible are likely to be useful. Just not enough qubits to represent the kinds of data that would need to be processed – this is a way, way harder problem than trying to reverse encryption secured by the difficulty of prime factorization.

So then came transformers. Seemingly capable of true AI, or, at least, scaling to being good enough to be called true AI, with astonishing capabilities. For the uninitiated, a transformer is basically a big pile of linear algebra that takes a sequence of tokens and computes the likeliest next token. More specifically, they are fed one token at a time, which builds an internal state that ultimately guides the generation of the next token. This sounds bizarre and probably impossible, but the huge research breakthrough was figuring out that, by starting with essentially random coefficients (weights and biases) in the linear algebra, and during training back-propagating errors, these weights and biases could eventually converge on something that worked. Exactly why this works is still somewhat mysterious, though progress has been made.

Transformers aren't killed by the NP-completeness and scaling problems that caused the first AI winter. Technically, a single turn-of-the-handle, generating the next token from the previous token and some retained state, always takes the same amount of time. This inner loop isn't Turing-complete – a simple program with a while loop in it is computationally more powerful. If you allow a transformer to keep generating tokens indefinitely this is probably Turing-complete, though nobody actually does that because of the cost.

Transformers also solved scaling, because their training can be unsupervised (though, practically they do often need supervised training in order to create guardrails against dangerous behaviour). It is now standard practice to train new models on just about every book ever written and everything that can be scraped from the internet.

That's the good news. That was the good news. But we've gone past that point now, and we are now all up against the reality of widespread use of transformers.

All transformers have a fundamental limitation, which can not be eliminated by scaling to larger models, more training data or better fine-tuning. It is fundamental to the way that they operate. On each turn of the handle, transformers emit one new token (a token is analogous to a word, but in practice may represest word parts or even complete commonly used small phrases – this is why chatbots don't know how to spell!). In practice, the transformer actually generates a number for every possible output token, with the highest number being chosen in order to determine the token. This token is then fed back, so that the model generates the next token in the sequence. The problem with this approach is that the model will always generate a token, regardless of whether the context has anything to do with its training data. Putting it another way, the model generates tokens on the basis of what 'looks most plausible' as a next token. If this is a bad choice, and gets fed back, the next token will be generated to match that bad choice. And as the handle keeps turning, the model will generate text that looks plausible. Models are very good at this, because this is what they are trained to do. Indeed, it's all they can do. This is the root of the hallucination problem in transformers, and is unsolveable because hallucinating is all that transformers can do.

I would conjecture that this is another manifestation of the NP-completeness wall that slammed symbolic AI, causing the first AI winter. It's always possible to turn an NP-complete algorithm into one that runs quickly, if you don't mind that it fails to generate any output if you hit a timeout. The transformer equivalent of this is generating plausible, wrong, hallucinated output in cases where it can't pattern match a good result based on its training. The problem, though, is that with traditional AI algorithms you typically know if you've hit a timeout, or if none of your knowledge rules match. With transformers, generating wrong output looks exactly like generating correct output, and there is no way to know which is which.

Practically, this manifests as transformers generating bad output a percentage of the time. Depending on the context, and how picky you need to be about recognizing good or bad output, this might be anywhere from a 60% to a 95% success rate, with the remaining 5%-40% being bad results. This just isn't good enough for most practical purposes. More concerning is the fact that larger transformer models produce extremely plausible bad output, that can only be identified as bad by genuine experts.

The rumour mill has it that about 95% of generative AI projects in the corporate world are failures. This isn't really surprising to anyone who was around for the dot com bubble, where corporate executives all seemed to assume that just being online would somehow transform their businesses, and that new ventures only really needed user count and that the financials would sort themselves out later. The same thing is happening again with generative AI, though the numbers are far larger. It is absolutely inevitable that the bubble will burst, and fairly soon. Expect OpenAI to crash, hard, with investors losing their shirts. Expect AI infra spends to be cancelled and/or clawed back. Expect small AI startups that aren't revenue positive to vanish overnight. Expect use cases based on unrealistic expectations of LLM capabilites to crash the hardest.

A good example is transformers used to assist in programming, or to generate code from scratch. This has convinced many non-programmers that they can program, but the results are consistently disastrous, because it still requires genuine expertise to spot the hallucinations. Plausible hallucinations in code often result in really horrible bugs, security holes, etc., and can be incredibly difficult to find and fix. My own suspicion is that this might get you close to what you think is finished, but actually getting over the line to real production code still requires real engineering, and it's a horrible liability to have to maintain a codebase that nobody on the team actually authored.

Transformers must never be used for certain applications – their failure rate is unacceptable for anything that might directly or indirectly harm (or even significantly inconvenience) a human. This means that they should never be used in medicine, for evaluation in school or college, for law enforcement, for tax assessment, or a myriad of other similar cases. It is difficult to spot errors even when you are an expert, so nonexpert users have no chance whatsoever.

The technology won't disappear – existing models, particularly in the open source domain, will still be available, and will still be used, but expect a few 'killer app' use cases to remain, with the rest falling away. We're probably stuck with spammy AI slop, and with high school kids using gen AI to skip their boring homework. We'll probably keep AI features in text editors, and a few other places.

I know that this is a currently-unpopular opinion. It is based on solid science, however. For what it's worth, I founded a chatbot company back in the late 90s, based on symbolic AI technology, that went splat in the dot com crash. I've been around this block, and I've stayed up to date on the technology – I've built my own transformer from scratch, and have experimented quite a bit.

My advice: unwind as much exposure as possible you might have to a forthcoming AI bubble crash.

Winter is coming, and it's harsh on tulips.

Subscribe to Taranis

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe