Are large language models wrong for coding?
Are large language models wrong for coding?
The rise of large language models (LLMs) such as GPT-4, with their ability to generate highly fluent, confident text has been remarkable, as I’ve written. Sadly, so has the hype: Microsoft researchers breathlessly described the Microsoft-funded OpenAI GPT-4 model as exhibiting “sparks of artificial general intelligence.” Sorry, Microsoft. No, it doesn’t.
Unless, of course, Microsoft meant the tendency to hallucinate—generating incorrect text that is confidently wrong—which is all too human. GPTs are also bad at playing games like chess and go, quite iffy at math, and may write code with errors and subtle bugs. Join the club, right?
None of this means that LLMs/GPTs are all hype. Not at all. Instead, it means we need some perspective and far less exaggeration in the generative artificial intelligence (GenAI) conversation.
As detailed in an IEEE Spectrum article, some experts, such as Ilya Sutskever of OpenAI, believe that adding reinforcement learning with human feedback can eliminate LLM hallucinations. But others, such as Yann LeCun of Meta and Geoff Hinton (recently retired from Google), argue that a more fundamental flaw in large language models is at work. Both believe that large language models lack non-linguistic knowledge, which is critical for understanding the underlying reality that language describes.
In an interview, Diffblue CEO Mathew Lodge argues there’s a better way: “Small, fast, and cheap-to-run reinforcement learning models handily beat massive hundred-billion-parameter LLMs at all kinds of tasks, from playing games to writing code.”
Are we looking for AI gold in the wrong places?
Shall we play a game?
As Lodge related, generative AI definitely has its place, but we may be trying to force it into areas where reinforcement learning is much better. Take games, for example.
Levy Rozman, an International Master at chess, posted a video where he plays against ChatGPT. The model makes a series of absurd and illegal moves, including capturing its own pieces. The best open source chess software (Stockfish, which doesn’t use neural networks at all) had ChatGPT resigning in less than 10 moves after the LLM could not find a legal move to play. It’s an excellent demonstration that LLMs fall far short of the hype of general AI, and this isn’t an isolated example.
Google AlphaGo is currently the best go-playing AI, and it’s driven by reinforcement learning. Reinforcement learning works by (smartly) generating different solutions to a problem, trying them out, using the results to improve the next suggestion, and then repeating that process thousands of times to find the best result.
In the case of AlphaGo, the AI tries different moves and generates a prediction of whether it’s a good move and whether it is likely to win the game from that position. It uses that feedback to “follow” promising move sequences and to generate other possible moves. The effect is to conduct a search of possible moves.
The process is called probabilistic search. You can’t try every move (there are too many), but you can spend time searching areas of the move space where the best moves are likely to be found. It’s incredibly effective for game-playing. AlphaGo has beaten go grandmasters in the past. AlphaGo is not infallible, but it currently performs better than the best LLMs today.
Probability versus accuracy
When faced with evidence that LLMs significantly underperform other types of AI, proponents argue that LLMs “will get better.” According to Lodge, however, “If we’re to go along with this argument we need to understand why they will get better at these kinds of tasks.” This is where things get difficult, he continues, because no one can predict what GPT-4 will produce for a specific prompt. The model is not explainable by humans. It’s why, he argues, “‘prompt engineering’ is not a thing.” It’s also a struggle for AI researchers to prove that “emergent properties” of LLMs exist, much less predict them, he stresses.
Arguably, the best argument is induction. GPT-4 is better at some language tasks than GPT-3 because it is larger. Hence, even larger models will be better. Right? Well…
“The only problem is that GPT-4 continues to struggle with the same tasks that OpenAI noted were challenging for GPT-3,” Lodge argues. Math is one of those; GPT-4 is better than GPT-3 at performing addition but still struggles with multiplication and other mathematical operations.
Making language models bigger doesn’t magically solve these hard problems, and even OpenAI says that larger models are not the answer. The reason comes down to the fundamental nature of LLMs, as noted in an OpenAI forum: “Large language models are probabilistic in nature and operate by generating likely outputs based on patterns they have observed in the training data. In the case of mathematical and physical problems, there may be only one correct answer, and the likelihood of generating that answer may be very low.”
By contrast, AI driven by reinforcement learning is much better at producing accurate results because it is a goal-seeking AI process. Reinforcement learning deliberately iterates toward the desired goal and aims to produce the best answer it can find, closest to the goal. LLMs, notes Lodge, “are not designed to iterate or goal-seek. They are designed to give a ‘good enough’ one-shot or few-shot answer.”
A “one shot” answer is the first one the model produces, which is obtained by predicting a sequence of words from the prompt. In a “few shot” approach, the model is given additional samples or hints to help it make a better prediction. LLMs also typically incorporate some randomness (i.e., they are “stochastic”) in order to increase the likelihood of a better response, so they will give different answers to the same questions.
Not that the LLM world neglects reinforcement learning. GPT-4 incorporates “reinforcement learning with human feedback” (RLHF). This means that the core model is subsequently trained by human operators to prefer some answers over others, but fundamentally that does not change the answers the model generates in the first place. For example, Lodge says, an LLM might generate the following alternatives to complete the sentence “Wayne Gretzky likes ice ….”
- Wayne Gretzky likes ice cream.
- Wayne Gretzky likes ice hockey.
- Wayne Gretzky likes ice fishing.
- Wayne Gretzky likes ice skating.
- Wayne Gretzky likes ice wine.
The human operator ranks the answers and will probably think a legendary Canadian ice hockey player is more likely to like ice hockey and ice skating, despite ice cream’s broad appeal. The human ranking and many more human-written responses are used to train the model. Note that GPT-4 doesn’t pretend to know Wayne Gretzky’s preferences accurately, just the most likely completion given the prompt.
In the end, LLMs are not designed to be highly accurate or consistent. There’s a trade-off between accuracy and deterministic behavior in return for generality. All of which means, for Lodge, that reinforcement learning beats generative AI for applying AI at scale.
Applying reinforcement learning to software
What about software development? As I’ve written, GenAI is already having its moment with developers who have discovered improved productivity using tools like GitHub Copilot or Amazon CodeWhisperer. That’s not speculative—it’s already happening. These tools predict what code might come next based on the code before and after the insertion point in the integrated development environment.
Indeed, as David Ramel of Visual Studio Magazine suggests, the latest version of Copilot already generates 61% of Java code. For those worried this will eliminate software developer jobs, keep in mind that such tools require diligent human supervision to check the completions and edit them to make the code compile and run correctly. Autocomplete has been an IDE staple since the earliest days of IDEs, and Copilot and other code generators are making it much more useful. But large-scale autonomous coding, which would be required to actually write 61% of Java code, it’s not.
Reinforcement learning, however, can do accurate large-scale autonomous coding, Lodge says. Of course, he has a vested interest in saying so: In 2019 his company, Diffblue, released its commercial reinforcement learning-based unit test-writing tool, Cover. Cover writes full suites of unit tests without human intervention, making it possible to automate complex, error-prone tasks at scale.
Is Lodge biased? Absolutely. But he also has a lot of experience to back up his belief that reinforcement learning can outperform GenAI in software development. Today, Diffblue uses reinforcement learning to search the space of all possible test methods, write the test code automatically for each method, and select the best test among those written. The reward function for reinforcement learning is based on various criteria, including coverage of the test and aesthetics, which include a coding style that looks as if a human has written it. The tool creates tests for each method in an average of one second.
If the goal is to automate writing 10,000 unit tests for a program no single person understands, then reinforcement learning is the only real solution, Lodge contends. “LLMs can’t compete; there’s no way for humans to effectively supervise them and correct their code at that scale, and making models larger and more complicated doesn’t fix that.”
The takeaway: The most powerful thing about LLMs is that they are general language processors. They can do language tasks they have not been explicitly trained to do. This means they can be great at content generation (copywriting) and plenty of other things. “But that doesn’t make LLMs a substitute for AI models, often based on reinforcement learning,” Lodge stresses, “which are more accurate, more consistent, and work at scale.”