The Underwhelming Hype Behind AI and GPT-3

I came across Rich Sutton’s essay The Bitter Lesson last week. If you have any interest in the technology of the future, I highly recommend you give it a short read.

Sutton’s thesis is that while we often seek to apply our domain knowledge in clever ways to solve classical AI problems, the gradual increase in available compute over time enables these problems to be more easily solved by large scale computation, leaving our initial efforts wasted.

“Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available.”

Rich Sutton, The Bitter Lesson

From beating experts in chess and Go to achieving state of the art abilities in speech recognition and computer vision, there is no better solution a large probabilistic model that can automatically pick out the most important features within the input data. These models are overwhelmingly black boxes, meaning that they represent some high-dimensional function that can’t easily be understood by humans.

Recently, GPT-3, the latest addition to the line of language processing models, was released to the public. GPT-3 at first glance is absolutely incredible. It can write code in React, teach students as famous academics, and write entire blog posts (to which I must clarify that this post is written entirely by a human). Unlike any other previous stride in AI research, OpenAI released GPT-3 via an API in beta. There’s a few reasons for this:

GPT-3 is a few-shot learner: you only need to provide it with a few initial samples to prime it for the particular problem you want to solve. OpenAI has a neat little web interface to prime GPT-3 and its nice that there’s no code or CLI necessary to complete this step.
Perhaps the bigger reason, however, is that GPT-3 hosts a whopping 175 billion parameters. GPT-3 is so capable because its been trained on a huge corpus of text, mostly scrapped from the internet. It’s parameters are finely tuned to nearly every publication, tweet, and website available online. In this sense, GPT-3 is nothing more than a glorified Markov chain.

VC Twitter and much of the tech interwebs flocked towards GPT-3 as a marvel of artificial intelligence. Maybe we’ve finally built an algorithm that can defeat the Turing Test? The answer to which is no. GPT-3 occasionally spitballs flat out incorrect responses like this:

Q [researcher]: Which is heavier, a toaster or a pencil?

A [GPT-3]: A pencil is heavier than a toaster.

I concede that GPT-3 is an incredible human achievement, but it is incredibly difficult to replicate. Moreover, referring to GPT-3 as intelligent is no more accurate than referring to, say, a mirror as intelligent. GPT-3 is a reflection of our behavior on the Internet, and it isn’t clear that its capable of rational thought. In fact, it can barely find non-lexical patterns (such as accurately identifying parity counts in binary strings). A lot of the results that have been circiling the internet seem to be cherry-picked from several runs. GPT-3 is a step in the right direction, but its inevitable that this hype will die down in the upcoming weeks.

“The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.”

Edsger Dijkstra

More importantly, we need better ways of understanding algorithms like GPT-3. When they fail, they fail hard — and especially when the training corpus is so large, we need to know why. In recent years, questions about algorithmic bias in AI have arisen with respect to racism and other forms of discrimination that are cloaked in the training data. What type of tools do we need to build in order to discover this bias beforehand, and what would be the adequate steps necessary to eliminate it?

My final question involves the practicality of GPT-3: man, it’s huge! It took over 350 GB of RAM and $12 million for training. How can we size this model down? In a few years, we’d love to have this type of tool (or one of its successors) in our phones, and I see no clear path to reaching that goal. Moore’s Law is undeniably slowing down, and we need to either find more clever approaches at designing our algorithms to be space efficient (i.e. not having 175 billion parameters) or developing better silicon. At this current stage of technology, it isn’t clear that Sutton’s thesis will hold true in coming years.

To these ends, we need to be more cautious when evangelizing AI. We need to think less like machines and more like people. More data, compute, and money leads towards better solutions, according to history, but it could also be a local-minimum we’re currently trapped in. In a meta sense, there are many parameters we as scientists and researchers can optimize for, and these three may not be the best. Like AI, perhaps we also need to think n-dimensionally in order to solve this tough problem.

The (Underwhelming) Hype Behind AI and GPT-3