OpenAI!
I have some exciting news (for me, anyway). Starting next week, I’ll be going on leave from UT Austin for one year,to work at OpenAI. They’re the creators of the astonishing GPT-3 and DALL-E2, which have not only endlessly entertained me and my kids, but recalibrated my understanding of what, for better and worse, the world is going to look like for the rest of our lives. Working with an amazing team at OpenAI, including Jan Leike, John Schulman, and Ilya Sutskever, my job will be think about the theoretical foundations of AI safety and alignment. What, if anything, can computational complexity contribute to a principled understanding of how to get an AI to do what we want and not do what we don’t want?
Yeah, I don’t know the answer either. That’s why I’ve got a whole year to try to figure it out! One thing I know for sure, though, is that I’m interested both in the short-term, where new ideas are now quickly testable, and where the misuse of AI for spambots, surveillance, propaganda, and other nefarious purposes is already a major societal concern, and the long-term, where one might worry about what happens once AIs surpass human abilities across nearly every domain. (And all the points in between: we might be in for a long, wild ride.) When you start reading about AI safety, it’s striking how there are two separate communities—the one mostly worried about machine learning perpetuating racial and gender biases, and the one mostly worried about superhuman AI turning the planet into goo—who not only don’t work together, but are at each other’s throats, with each accusing the other of totally missing the point. I persist, however, in the possibly-naïve belief that these are merely two extremes along a single continuum of AI worries. By figuring out how to align AI with human values today—constantly confronting our theoretical ideas with reality—we can develop knowledge that will give us a better shot at aligning it with human values tomorrow.
For family reasons, I’ll be doing this work mostly from home, in Texas, though traveling from time to time to OpenAI’s office in San Francisco. I’ll also spend 30% of my time continuing to run the Quantum Information Center at UT Austin and working with my students and postdocs. At the end of the year, I plan to go back to full-time teaching, writing, and thinking about quantum stuff, which remains my main intellectual love in life, even as AI—the field where I started, as a PhD student, before I switched to quantum computing—has been taking over the world in ways that none of us can ignore.
Maybe fittingly, this new direction in my career had its origins here on Shtetl-Optimized. Several commenters, including Max Ra and Matt Putz, asked me point-blank what it would take to induce me to work on AI alignment. Treating it as an amusing hypothetical, I replied that it wasn’t mostly about money for me, and that:
The central thing would be finding an actual potentially-answerable technical question around AI alignment, even just a small one, that piqued my interest and that I felt like I had an unusual angle on. In general, I have an absolutely terrible track record at working on topics because I abstractly feel like I “should” work on them. My entire scientific career has basically just been letting myself get nerd-sniped by one puzzle after the next.
Anyway, Jan Leike at OpenAI saw this exchange and wrote to ask whether I was serious in my interest. Oh shoot! Was I? After intensive conversations with Jan, others at OpenAI, and others in the broader AI safety world, I finally concluded that I was.
I’ve obviously got my work cut out for me, just to catch up to what’s already been done in the field. I’ve actually been in the Bay Area all week, meeting with numerous AI safety people (and, of course, complexity and quantum people), carrying a stack of technical papers on AI safety everywhere I go. I’ve been struck by how, when I talk to AI safety experts, they’re not only not dismissive about the potential relevance of complexity theory, they’re more gung-ho about it than I am! They want to talk about whether, say, IP=PSPACE, or MIP=NEXP, or the PCP theorem could provide key insights about how we could verify the behavior of a powerful AI. (Short answer: maybe, on some level! But, err, more work would need to be done.)
How did this complexitophilic state of affairs come about? That brings me to another wrinkle in the story. Traditionally, students follow in the footsteps of their professors. But in trying to bring complexity theory into AI safety, I’m actually following in the footsteps of my student: Paul Christiano, one of the greatest undergrads I worked with in my nine years at MIT, the student whose course project turned into the Aaronson-Christiano quantum money paper. After MIT, Paul did a PhD in quantum computing at Berkeley, with my own former adviser Umesh Vazirani, while also working part-time on AI safety. Paul then left quantum computing to work on AI safety full-time—indeed, he founded the safety group at OpenAI. Paul has since left to found his own AI safety organization, the Alignment Research Center (ARC), although he remains on good terms with the OpenAI folks. Paul is largely responsible for bringing complexity theory intuitions and analogies into AI safety—for example, through the “AI safety via debate” paper and the Iterated Amplification paper. I’m grateful for Paul’s guidance and encouragement—as well as that of the others now working in this intersection, like Geoffrey Irving and Elizabeth Barnes—as I start this new chapter.
So, what projects will I actually work on at OpenAI? Yeah, I’ve been spending the past week trying to figure that out. I still don’t know, but a few possibilities have emerged. First, I might work out a general theory of sample complexity and so forth for learning in dangerous environments—i.e., learning where making the wrong query might kill you. Second, I might work on explainability and interpretability for machine learning: given a deep network that produced a particular output, what do we even mean by an “explanation” for “why” it produced that output? What can we say about the computational complexity of finding that explanation? Third, I might work on the ability of weaker agents to verify the behavior of stronger ones. Of course, if P≠NP, then the gap between the difficulty of solving a problem and the difficulty of recognizing a solution can sometimes be enormous. And indeed, even in empirical machine learing, there’s typically a gap between the difficulty of generating objects (say, cat pictures) and the difficulty of discriminating between them and other objects, the latter being easier. But this gap typically isn’t exponential, as is conjectured for NP-complete problems: it’s much smaller than that. And counterintuitively, we can then turn around and use the generators to improve the discriminators. How can we understand this abstractly? Are there model scenarios in complexity theory where we can prove that something similar happens? How far can we amplify the generator/discriminator gap—for example, by using interactive protocols, or debates between competing AIs?
OpenAI, of course, has the word “open” right in its name, and a founding mission “to ensure that artificial general intelligence benefits all of humanity.” But It’s also a for-profit enterprise, with investors and paying customers and serious competitors. So, throughout the year, don’t expect me to share any proprietary information—that’s not my interest anyway, even if I hadn’t signed an NDA. But do expect me to blog my general thoughts about AI safety as they develop, and to solicit feedback from readers.
In the past, I’ve often been skeptical about the prospects for superintelligent AI becoming self-aware and destroying the world anytime soon (see, for example, my 2008 post The Singularity Is Far). While I was aware since 2005 or so of the AI-risk community; and of its leader and prophet, Eliezer Yudkowsky; and of Eliezer’s exhortations to drop everything else they’re doing and work on AI risk, as the biggest, I … kept the whole thing at arms’ length. Even supposing I agreed that this was a huge thing to worry about, I asked, what on earth do you want me to do about it today? We know so little about a future superintelligent AI and how it would behave that any actions we took today would probably be useless or counterproductive.
Over the past 15 years, though, my and Eliezer’s views underwent a dramatic and ironic reversal. If you read Eliezer’s “litany of doom” from two weeks ago, you’ll see that he’s now resigned and fatalistic: because his early warnings weren’t heeded, he argues, humanity is almost certainly doomed and an unaligned AI will soon destroy the world. He says that there are basically no promising directions in AI safety research: for any alignment strategy anyone points out, Eliezer can trivially refute it by explaining how (e.g.) the AI would be wise to our plan, and would pretend to go along with whatever we wanted while secretly plotting against us.
The weird part us, just as Eliezer became more and more pessimistic about the prospects for getting anywhere on AI alignment, I’ve become more and more optimistic. Part of my optimism is because people like Paul Christiano have laid foundations for a meaty mathematical theory: much like the web (or quantum computing theory) in 1992, it’s still in a ridiculously primitive stage, but even my imagination now suffices to see how much more could be built there. An even greater part of my optimism is because we now live in a world with GPT-3, DALL-E2, and other systems that, while they clearly aren’t AGIs, are powerful enough that worrying about AGIs has come to seem more like prudence than like science fiction. I didn’t predict that such things would be possible by 2022. Most of you probably didn’t predict it. For godsakes, Eliezer Yudkowsky didn’t predict it. But it’s happened. And to my mind, one of the defining virtues of science is that, when empirical reality gives you a clear shock, you update and adapt, rather than expending your intelligence to come up with clever reasons why it doesn’t matter or doesn’t count.
Anyway, so that’s the plan! If I can figure out a way to save the galaxy, I will, but I’ve set my goals slightly lower, at learning some new things and doing some interesting research and writing some papers about it and enjoying a break from teaching. Wish me a non-negligible success probability!