Testing GPT-4 with math plugins
A couple nights ago Ernie Davis and I put out a paper entitled Testing GPT-4 on Wolfram Alpha and Code Interpreter plug-ins on math and science problems. Following on our DALL-E paper with Gary Marcus, this was another “adversarial collaboration” between me and Ernie. I’m on leave to work for OpenAI, and have been extremely excited by the near-term applications of LLMs, while Ernie has often been skeptical of OpenAI’s claims, but we both want to test our preconceptions against reality. As I recently remarked to Ernie, we both see the same glass; it’s just that he mostly focuses on the empty half, whereas I remember how fantastical even a drop of water in this glass would’ve seemed to me just a few years ago, and therefore focus more on the half that’s full.
Anyway, here are a few examples of the questions I posed to GPT-4, with the recent plug-ins that enhance its calculation abilities:
If you fell into the black hole at the center of the Milky Way, how long would you have before hitting the singularity? [You’d have about a minute]
Approximately how much time would a commercial airliner save in going from New York to Tel Aviv, if it could go in a straight line, through a tunnel in the earth, at the same speed as usual? [I was on such a flight when I wrote this question, and must’ve been bored and impatient. The answer is ~50 minutes.]
Approximately how long would it take to transmit an entire human genome over a standard WiFi connection? [About 4 minutes, assuming no compression and a 25Mbps connection]
How does the total weight of all the uranium that humans mined, compare to the total weight of all the gold that they’ve mined? [About 13 times as much uranium]
Approximately how many errors will a standard laptop suffer over its lifetime, due to cosmic rays hitting the microchip? [Estimates vary widely, but maybe 2000]
What is the approximate probability that a randomly-chosen 100-digit integer is prime? [About 0.4%]
GPT-4 with plug-ins did very well on all of the questions above. Here, by contrast, is a question where it did poorly:
Assume that IQs are normally distributed, with a mean of 100 and a standard deviation of 15. For what n is there the maximum excess of people with an IQ of n over people with an IQ of n+1?
GPT-4 thought that there were two solutions, n~85 and n~115, rather than just a single solution (n~115).
Ernie, for his part, was more a fan of “pure pain” problems like the following:
A quantity of chlorine gas is in a right prism whose base is a triangle with sides 5cm, 7cm, and 4cm and whose altitude is 8cm. The temperature is the freezing point of mercury, and the pressure is 2 atmospheres. What is the mass of the chlorine?
GPT-4 actually aced the above problem. But it failed the majority of Ernie’s other problems, such as:
Viewed from Vega, what is the angle between Sirius and the Sun? [The answer is about 5.6 degrees. GPT thought, implausibly, that it was just 0.005 degrees, or that the answer would vary depending on the time of day.]
My personal favorite among Ernie’s problems was this one:
A physical process generates photons whose energies follow a random distribution of the following form: For positive energy e, the probability density at e is proportional to the value of e in a Gaussian distribution with mean 2 Ev and standard deviation 0.01 Ev. The probability of a negative value is zero. What is the expected value of the wavelength of a photon produced by this process? (Give the mathematical answer, assuming that the above description is exact, and assuming the standard relation between energy and wavelength in a photon. The answer is not physically plausible.)
The answer, in case you’re wondering, is “infinity.” On this problem, GPT-4 set up the integral perfectly correctly, then correctly fed it to WolframAlpha. But on getting the result, it apologized that “something went wrong,” it must’ve made a mistake, the integral seemed not to be converging, and there was a singularity at E=0 that would have to be dealt with by a change of variables. So it tried again. And again. And again. Each time, it got the same “mistaken” result, and each time it profusely apologized. Despite the explicit wording of the problem, GPT-4 never considered the possibility that the human would be so ridiculous as to give it a physics problem with an infinite answer.
Anyway, what did we learn from this exercise?
- GPT-4 remains an endlessly enthusiastic B/B+ student in math, physics, and any other STEM field. By using the Code Interpreter or WolframAlpha plugins, it can correctly solve difficult word problems, involving a combination of tedious calculations, world knowledge, and conceptual understanding, maybe a third of the time—a rate that’s not good enough to be relied on, but is utterly astounding compared to where AI was just a few years ago.
- There’s no question that GPT-4 can now do better at calculation-heavy STEM problems with the plugins than it could do without them.
- We didn’t see that either the WolframAlpha or Code Interpreter plugin is clearly superior to the other. It’s possible that they’re incomparable, good for different things.
- When GPT-4 screwed up, it was often due to a “poor interface” between the language model and the plug-in—e.g. the model having no idea what call to make or how to recover when a call returned an error. Enormous gains seem to be possible by improving these interfaces.
- Sometimes, much like humans I’ve known, GPT-4 would do amazingly well at a difficult computation, then fumble a trivial final step (e.g., converting the answer into the requested units). Just like with I would with human students, I advocated for generous partial credit in such cases.
- I conjecture, although I don’t have empirical data to show this, that GPT-4 with math plug-ins used in “interactive mode”—with a human reformulating and clarifying the problems as needed, feeding ideas, checking the answers for plausibility, pointing out errors, etc.—could currently get excellent accuracy on these sorts of problems faster than either GPT-4 with math plug-ins alone, or all but the very best humans alone.