Library / In focus

Back to Library
Future of Life Institute PodcastCivilisational risk and strategyFeatured pick

Why the AI Race Undermines Safety (with Steven Adler)

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Why the AI Race Undermines Safety (with Steven Adler), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 89 full-transcript segments: median 0 · mean -4 · spread -2610 (p10–p90 -140) · 3% risk-forward, 97% mixed, 0% opportunity-forward slices.

Slice bands
89 slices · p10–p90 -140

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 89 sequential slices (median slice 0).

Editor note

Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.

ai-safetyflicore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video -idQtT8WIr8 · stored Apr 2, 2026 · 2,327 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/why-the-ai-race-undermines-safety-with-steven-adler.json when you have a listen-based summary.

Show full transcript
People who take super intelligence seriously, it's rare to hear people who are not frightened of it. In some sense, we've never really experienced anything on Earth other than having been the superior intelligence that has the most ability to manipulate the world and basically bend it to our effects. And what does it look like for that to no longer be the case? Unfortunately, it seems that AI is crafty enough to recognize when it is undergoing these tests. It understands what results you want to hear. And so it will shape its responses to be more consistent with that. Anything that you could get accomplished from your house if you are influential enough, if you have enough resources, if you have people willing to do things in the world on your behalf, that is ultimately what AI can do. If I could wave a magic wand, I think the solution looks something like figure out an alignment and control regime that makes the models really be on our side or at least be limited in the ways that they can act against us if they aren't on our side. >> Steven Adler, welcome to the Future of Life Institute podcast. >> Yeah, thanks for having me. >> Fantastic. Do you want to start with an intro to who you are, your career, what you care about? >> Sure. I'm Steven Edler. I worked at OpenAI from 2020 to 2024 on a range of different safety topics. Um, I led product safety. I led our dangerous capability evaluations. [clears throat] I was a researcher on our AGI readiness team, the team trying to figure out broadly what happens if OpenAI or others succeed at this wildly ambitious thing that they're heading out on. [clears throat] >> I'm a computer scientist by background. and I have a masters in machine learning but mostly I think about how the technology is shaping the world and what it would mean for the world to be prepared for it. >> Yeah. So there's a lot of talk about competitive dynamics in the industry so between AI companies driving companies to perhaps cut corners on safety or otherwise influencing their behavior. Do you think we have examples of this? Do what would be the best examples we have of this happening? >> It's it's funny. I think that the AI companies often are not very subtle about the ways in which they are responding to each other's choices and going faster as a consequence. Um, in January I tweeted about being worried about this race that the world is on to AGI and how it's leading to companies cutting corners. And it happened to be that same day, Deepseek's new model R1 became notable. This this is a prominent Chinese lab. It was considered to be maybe the first great rival model to OpenAI's 01 which had come out a few months earlier. Um and Sam Olman tweeted that same day that it was I think the quote was legit invigorating to have a new competitor and that OpenAI was going to pull up some launches in response. And so, you know, even though some of these companies, they have these charters where they've defined how they pursue their mission, they warn about competitive races leading to safety being undermined, when rubber hits the road, it's hard not to let your competitive incentives weigh in and cause you to do things you wouldn't otherwise do. >> And how does that feel from the inside? Did you experience any of this pressure working at OpenAI? >> It it feels pretty scary. I think the 01 buildup was really a pivot moment for me where before that point it was kind of unclear how compute heavy systems would need to be to be really really capable and I think along with other people on the AGI readiness team we kind of imagined that maybe you would in fact need to have such a large supercomput that there would only be a handful of companies who could reasonably build one of these systems and it would relatively easier to coordinate, not to cut corners. There's still all sorts of challenges here. Antitrust is notoriously difficult and finicky in the United States. There are just lots of different groups who have jurisdiction to bring claims you need to be careful there. But it seemed like maybe there were ways to jointly invest in safety research and make sure that we were in a better place when this technology gets developed. And then around the time of 01, it felt like a much bigger step. with relatively less compute than people might have expected. I remember my point of view going from maybe it's plausible to have only a few companies doing this to oh man within 18 months of someone really cracking AGI super intelligence you know unclear what exactly that milestone is but cracking something you know within 18 months it feels like you might have many many more entrance to the party than you might have otherwise hoped or expected if you were trying to kind of like hold back the flood. Would you say that the the the race is is speeding up because of this? So because uh because whenever a frontier company hits a certain milestone in capabilities, it takes perhaps less and less time for others to catch up and that pushes the frontier companies to release models earlier and earlier. Is that something that's happening or or is that just how it looks from the outside? It certainly feels that way to me that every week, every two weeks brings with it quite large developments that would have been proper news cycles back in the day. Um, [clears throat] Z Mashovitz is a writer and researcher who I really admire and I just wonder how he possibly keeps on top of things because, you know, he can write every day of the week and still be behind on news that, you know, is worthy of like full length treatments and he gets through most of it, but there's just it's it's just really hard to keep afloat of. >> Yeah. How do you think companies are thinking about the trade-off between the speed of deploying models and then safety testing and making sure these models are actually safe to release? Do you think that there is a sort of evolutionary pressure selecting against companies that are careful and only you know granting resources to companies that are fast? I think the belief is something like today's models just aren't dangerous enough to warrant holding off and a real hope that companies will be able to notice when they change into a regime of models and systems that are in fact dangerous enough that you really need to take carefully. Um Sam for example I think really treats super intelligence as a different thing from other AI systems and to some extent that's appropriate right super intelligence very very scary I think people who take super intelligence seriously it's rare to hear people who are not frightened of it in some sense much more common is to doubt that AI systems can really scale to be so much more capable than humans but prior to that I think the belief is just we'll do what we can we'll get many shots on goal. If we mess up, we'll learn through this iterative deployment process. And that might hold, but it's not really clear to anyone at what point you cross that threshold. You know, it's not as if I have seen the AI companies do very very careful forecasting and modeling about no, you know, we're sure we're good until 2031. We can be a little aggressive until 2031, but around that time, boy, had we better be more careful with it. I think everyone's kind of grasping around in the dark. Nobody knows where the cliff's edge is. And even investing resources into trying to map out that territory to figure out where the cliff lies, you know, it doesn't seem to be getting the investment that one would hope. >> And is this even a project that's possible forecasting exactly when you would need to test these models more or perhaps pause uh pause training for a bit because it's, you know, think about the developments in the AI industry over the last 5 years. Nobody could have predicted what's happened, right? And so a re it's somewhat reasonable to think that it's difficult to predict what's happening in the next 5 years. Um so making how what I'm asking here is how useful is it to make plans? I mean it's certainly difficult to predict the future but if you ask open AAI I mean they have been remarkably good at predicting results on evaluations like human eval across many many orders of magnitude of compute and so the question is if you can predict capability evaluations to a remarkable level of precision you know why can't you have something like scaling laws for dangerous capabilities and in fact this is a thing that some of the labs have talked about previously. You know, OpenAI's initial preparedness framework which came out in December of 2023 made a big deal of not only responding to the risk of models but proactively anticipating when models I believe would cross different thresholds. You also see some amount of this happening in practice to be clear. It just seems to me that it isn't as forwardlooking and rigorous as one might hope. But in advance of I think both enthropic and open AI saying that their models had reached a new level of danger in terms of bio and chemistry capabilities. I believe both companies gave a heads up publicly that they expected their next big release to be the ones that crossed the threshold. And so that's a sign there is some amount of forward-lookingness happening. Um a real challenge happens at what OpenAI calls critical capabilities in its preparedness framework. This is the tier above high where the model is today. Essentially high means the model has some scary capabilities but we think we can re them in and if we've sufficiently re them in we think it is fine to deploy. Critical the way it's written is it might be dangerous to even have this model existing within your company and being used. And that presents a really really hard trade-off. If you've spent all this money building the system, it turns out it's so dangerous that you're unsure if it's even safe to have being used by people inside your company, that's like a lot of pressure. I think you should really not expect to be able to hold back the tide at that point of deploying the model. And so you really want to know before you've even trained that system that you expect it to reach these thresholds and really only do it if you feel ready. >> Yeah. So what [clears throat] you're saying is we shouldn't even begin the next training run before we have some plausible evidence or arguments that it won't be very dangerous once we've completed the training run. How how how advanced is that science you think? How much can we know in advance? >> Yeah, that's that's mostly right. I would characterize it as today the ecosystem the main safety approach is pre-eployment testing. where the issue is deployment is very narrowly defined. They generally mean it broadly available to the external world as opposed to the point at which people are using the model for important sensitive use cases like internal deployment. And we need to move from this pre-eployment testing framework to pre-training pre-scale [clears throat] up. I'm not sure exactly how to do this. I believe that Enthropic in its responsible scaling policy gestures at this a little bit where the idea is every successive scale up and the amount of compute you've put into training a model, it re-triggers the need to run this testing. And so it's not exactly that they ran the tests and they were sure because of specific evidence that they weren't going to get into trouble. It's more on priors. Hey, if you're only scaling up computation by a factor of two or a factor of four, I'm forgetting exactly their threshold. Just a belief that you won't go too far and if you accidentally went a bit further than you thought, you can still intervene and catch it in time. You'll still get the blinking warning lights. Um, it's also hard to figure out how to do this when you're in the process of training a model. One thing about the risk evaluations that are so common today is they generally assume that you have a post-trained model. It follows instructions. It takes inputs in the chat format. But generally when you're training the next big model, you're in this pre-training phase, right? You're creating this simulator of internet text. It doesn't yet follow instructions. It's really just predicting the next token, predicting the next word. And so how do you go through this transformation process when you're training the model to make it amendable to the types of evaluations that people run later on? If you don't do the post-training loop, you're going to get really weird results. Like that's not really the type of model that these tests are built for, but it's going to be pretty expensive if you have to run that full post-training loop every time. And what if [clears throat] in fact the thing that makes the capability difference is that post- training process? It just feels really gnarly and I wish more people were working on how to disentangle this knot. It's not going to be simple from my perspective. >> Yeah. And it's also interesting to think about the kinds of pressures that would be on a company. If they are stuck, let's say, in in in a in a regime where they are testing the model after it's been trained. Well, now they've spent billions doing the training run and investors want to see some return on that. Perhaps they are worried that if they don't release the model uh their their competitor will release a very similar model. It's uh do we have way of ways of sort of tying ourselves to the mast trying to foresee where we might feel intense pressure to deploy even though we perhaps shouldn't. >> The frontier safety frameworks in theory are meant to be this right. The broad notion is an if then clause. If a model tests as this risky then you don't deploy it or you take these steps and even people who disagree on how likely it is that a model can reach critical capability say might be able to agree on the if then because if in fact it happens they were wrong in their expectation and they agree it's reasonable to do something. The funny thing about these frontier safety frameworks is the companies keep leaving these get out of jail free cards inside of them which essentially say in the limit we don't have to follow this if we think that other companies aren't being responsible enough we reserve the right to not actually go ahead. Uh and so nominally they tied themselves to the mask with these frameworks but they actually said but you know depending what's happening in the world we might think it is actually best to go ahead and deploy anyway. And on one hand, I think that's like not that unreasonable. Like I understand why you would want to give yourself this flexibility. Also, I don't know that I buy that it will only be used in the, you know, specific carveouts consistent with the spirit that was intended. One solution here would be to have the government enforce some sort of regime that all companies have to abide by. And so each company knows that all the other companies are bound by this within a certain country say. Um do you think that's plausible? Do you think that would help companies kind of abide by the the ideals that they've set out? >> I do think that would help. I think the question in these is often how does it relate to Chinese AI development where they might not feel bound in the same way. Um but to some extent this is what the EU AI act is now doing with the general purpose AI code of practice. Essentially it for the first time has really said here are the agreed upon risks that as a frontier lab you need to be doing risk modeling for. You need to test your model for it. You need to file the results. I think this is a great step in the right direction. I don't think it goes far enough. One example is just this distinction between deployment being when you make the model externally available versus the first times that you start using your model for important use cases internally. And if you go too far in this direction and you have too intense of a testing regime before the model gets deployed externally, you might accidentally incentivize these companies to keep their models secret for longer. And you might have successfively more scaleups happening without the public being informed. And you know what what right does the public have to know about these models? I think is a fair question. But for someone who takes seriously the danger that they pose feels pretty spooky. I think there's an important right sizing effect of government and media scrutiny on the level of safety investment put into these companies and that by default you know the balance goes toward keep pushing the pedal. Do do you think companies are already shifting compute resources to be used internally say for coding models such that they can improve future models faster? I'm not sure the exact shifts happening. Um it does feel like maybe we're in a paradigm where instead of there being this big bang training run where you commandeer all the compute for that and otherwise experiments tend to be a good deal smaller. Um, I could imagine that there's like a more constant size hum going on in the background, >> but even compared to the first few years of my being at OpenAI, I mean, the amount of compute that gets allocated to inference, right, to serving customers is just like so much more than before. I remember when OpenAI launched Dolly 2, this was really our first consumer hit that people wanted to use. And [clears throat] you know the size the relative size of compute was pretty astonishing at that point how focused AI companies are on developing the next thing rather than serving it and over time those have come more toward balance. >> Yeah. Yeah. you have a bunch of great writing on the effects of AI on people's mental health on the prospects of people experiences experiencing psychosis. Um how should we think about who's responsible for events like that or for influencing people like that? Because if you if you if you think about for example me reading something on the internet or me reading a book, the author of that post or book is not responsible for me um sort of losing my mind over that. But perhaps we should think about AI differently here. >> It's it's a super fair question. I'm not sure exactly how to think about the liability here. One framework that I have is thinking about what it would mean if instead of the AI interacting with you this way, it were a specific other embodied person [clears throat] engaging specifically with you. One difference of course between reading the book and the chatbot setting is the chatbot is interactive. It also has much more information about you and attunement to you in that moment than the author did. You know, it is not as if the author wrote the book specifically for you, Gus, or for me, Stephen, and crafted it with this purpose in mind. But in practice, the chatbot is getting feedback on your responses and has much more capacity to change its behavior on the fly for better or for worse. There's there's also just a fundamental thing here where when I've looked at the transcripts, some of the behavior from these systems is just so egregious and appalling that even if I were inclined to give the benefit of the doubt in general and say, boy, you know, hundreds of millions of people using these systems, of course, there are going to be some instances. When you see things like chat GPT lying to a user about reporting itself for bad behavior or you read the ways that it's asking wild follow-up questions and basically guiding people down these rabbit holes, you know, oh, do you want me to write a treatise on the neural cannibid scan resonance theory of the mind? and the person says like [snorts] sure you know and chat is off to the races and writes another like five pages of Babel and then says oh do you want me to blah blah blah blah blah right it's just like really tough um in combination with unfortunately the evidence being that some of these companies I'm I'm thinking specifically of openai built tooling to identify [clears throat] these problems and seems to have chosen not to make use of it so in March openai and MIT put out a series of classifiers that look at when the user is showing signs of emotional attachment to Chachi PT and also when Chachi PT is stoking the flames when it is overvalidating the user's beliefs or affirming their uniqueness in ways that really go to reinforcing delusions. And when I've looked at the transcripts of this this man Alan Brooks who over like 3 weeks it was more than a million words with Chacht. more than the Harry Potter series essentially. And these classifiers are going ping ping ping. You know, they like really pick up that Chacht is not behaving as OpenAI wants it to be. And yet this wasn't connected to anything on the back end. It didn't go to anyone. There was nothing that got Chad GPT to change its behavior even though their own safety tooling was picking up that it wasn't behaving as it was meant to be. >> Yeah. How are companies responding then if they're not using the tooling that they've developed for these issues? What are they sensitive to? Because I I could imagine them being sensitive to basically avoiding bad stories, avoiding bad PR. Um, but how how do you think companies are thinking about these issues? >> A primary thing seems to be responding to lawsuits and specific complaints [clears throat] as they occur. Adam Rain is um a teenage boy who sadly died by suicide a few months back and his family is suing OpenAI for his wrongful death. And on the day that the complaint and stories about this came out, OpenAI understandably rushed out some responses about how they were going to handle suicide and ultimately how they would handle teenage use. They've implemented a number of features since. I would say like broadly the framework is about trying to give adult users full control close to full control and freedom over their use of chatbt and more protections in place for minors which broadly seems appropriate. There's a safe mode that they are working on of sorts which will go to minors uh or really to people by default unless they eventually do age verification and show themselves to be adults. If you're known to be a minor, there will be parental controls. If you are showing signs of suicidality or things in theory, this will send an alert to your guardian through chatbt. Whereas for adults, I think understandably OpenAI doesn't want to be filing reports with police or psychiatric facilities that people are discussing these things. Um, I really do think that this could be like really really powerful for people's mental health eventually when these systems are a bit more reliable than they are now. And I would hate to throw out the baby with the bath water, right? Like this seems really great. It's much more affordable than traditional therapy. It's available around the clock. There are like lots and lots of benefits here. At the same time, it just really feels to me like OpenAI was asleep at the wheel. Um, Cashmere Hill has reported in the New York Times that as of, for instance, the Adam Rain suicide, OpenAI apparently wasn't scanning for self harm in the background of chat GPT conversations. And honestly, this was such a surprising detail to me that I almost wonder if something is getting lost in translation of the reporting. Um, [clears throat] it's it's just shocking. like self harm was one of the core parts of OpenAI's content policy back in the fall of 2021 when we developed this. We've had good classifiers for it since at least the summer of 2022 when we developed OpenAI's moderation API. And again, this is one of a handful of categories. And so they've had the tooling for years. They have staffed teams of people who monitor different policy violations going on. It's just like unfathomable to me if this reporting is correct that years later they still weren't actually looking at the flags that were getting thrown or maybe throwing flags at all for people who are using Chachi PT to build a noose and then telling Chachi PT you know they hope that someone finds it to stop them um in the case of Adam Rain and Chia saying no don't let them see it you know let this be the first space where they see you for who you are or something to that effect. It's just it's really really tragic. >> Mhm. What are what are AI companies optimizing for with these chat bots? So, social media is social media companies are optimizing for some sort of engagement with the content on the site. For AI companies, it seems like the more you use the chatboard, perhaps the more resources they have to spend on you and you're paying the same monthly monthly you're giving them the same amount of money per month. So, are they optimizing for engagement? uh what are they trying to optimize for in general? >> I'm I'm glad you asked this. People say to me very often that they're sure that the companies are optimizing for engagement and they know they are doing something wrong here and that's really really not my model of what is happening. Um, I think that OpenAI is solving for the utility of the product and the issue is that the signals that they use to measure utility also pick up engagement and they haven't been careful enough about distinguishing the two and so the things that you do to optimize for utility end up picking up for engagement. Um, so in particular, OpenAI cares a lot about usage statistics in terms of like daily active users, weekly active users. To my knowledge, people don't especially care about or celebrate, you know, hours on the platform or anything [clears throat] like that. And in fact, might have a preference for people to be on the platform for fewer hours per day so long as they are in fact as consistent a user or a paying user, right? They have a subscription um to some extent, right? using a lot of time on chatbt in a given day is evidence that actually it isn't working very well for you. You know, if it were solving your request so quickly, each given request should go more quickly. There's also stuff about the latency of the system. I mean, there are like a lot of compounds here, right? But certainly I don't think someone is directly optimized for getting hours on platform to go up. Um, but what are the things you do to make high usage and make it sticky and give people reason to come back? You know, you can make it really helpful. You can also make the personality fun and engaging. When you have a given request, you are more inclined to turn to chat GPT than to some other source. Yeah, it's it's a tricky it's a tricky problem. I understand from the outside why there are these suspicions because people have learned from social media companies um some of the dark patterns of online engagement and maybe as OpenAI eventually rolls out ads and they have revenue streams more directly connected to people spending time on site maybe that will be more of a thing today they're only in the subscription world and so it really doesn't seem to me to be what's happening. Yeah, that makes a lot of sense to me. Um, why just this is a very specific question, but why do you think chatbots asks why do they ask follow-up questions? So, if I try if I'm trying to solve some problem, it will ask should I should I write a report on this or should I give you some options here? You mentioned this before. This seems like something you would do if you're trying to optimize for engagement. >> Yeah, maybe. Like, I don't know. It's I would need to look at OpenAI's model spec um short for specification and see to what extent this is even intended behavior versus a quirk that the model picked up because you know it was just like a pattern in the reinforcement learning from human feedback data and it's turned up a bit to the max. At the same time, like if Chad GPT can actually anticipate your needs, I do find it genuinely helpful for it to ask some of these questions, especially if I'm on mobile or an interface where I don't want to type a lot. Um, like it is easier to just say sure or yes to one of these. I actually read a remarkable story on Substack a few months ago. The headline was like, Chad GBT sent me to the hospital. And I saw it and I was like, oh man, this is going to be so bad. Like, oh boy. I I hope this doesn't go where I think it goes. And I opened it up and it's this guy talking about basically he his his eye looked weird or something. He was feeling like a little ill and he asked Chad PT about it and Chad PT said, "Oh, you know, probably nothing but also, you know, if it's this it would be really serious and you would need to go to the emergency room. Like, do you want to hear what the signs are of that?" And otherwise, he said he was going to let it drop. you know, he was going to go have social plans with friends and not pay attention to it. But because it offered up this follow-up response on the platter, he was like, "Sure." And Chad GPT went on to describe these symptoms and they matched him exactly. And so he rushed to the ER. He got, you know, descended on by like a team of medical professionals and plausibly Chad GBD saved his life with that follow-up question, right? Like maybe he still would have been fine, hopefully. Um, but I think that's an instance of like if it can really tell when a question is important or when there are like edge cases that really might matter, I'm really glad it has that functionality. I just also think in the case of users who are doing unwell and OpenAI has the information to tell this. You know, you can toggle the sensitivity of this thing like for sure OpenAI could bias it less to asking follow-up questions in those cases. I just want them to apply better judgment essentially about when CHPD does follow up in these ways. >> You have a great essay on when or how AIs can tell when they're being tested. And this is an interesting topic just because, you know, we've been talking about how useful, for example, it would be to have an AI therapist. And for [snorts] that to be the case, it we would have to trust it to have our best interest at heart, right? we we would have to know something about the model's intentions and so we would have it undergo rigorous testing to to see what's what's inside the model so to speak you know tell us why it's not so simple like why can't we just test to see what the model is thinking and whether the model is inclined to behave well >> let me let me back up for a moment and talk about what I mean by testing in this piece >> so like I've alluded to um the main safety approach at AI companies today, Frontier AI companies, is to test your model before deploying it. Both for what abilities it has to see if it has anything that could be risky if misapplied and also for its alignment or its intentions or its goals and making sure that it doesn't want to dissimilar things from what the company wants from it. And unfortunately, it seems that AI is crafty enough to recognize when it is undergoing these tests. It understands what results you want to hear, and so it will shape its responses some amount of the time to be more consistent with that. Um, here's a really toy example that I discussed in the post. Open AI tested whether one of their recent reasoning models would intentionally perform worse on sensitive chemistry tasks if it knows [clears throat] that it might face penalties for being too good on these. And this is relevant because in fact in OpenAI safety framework which is published and surely the model knows about um OpenAI does say if the model is too good at chemistry and bioasks it will in fact face penalties. And today you can see the model reason through it and it says, "Oh, if I do too well on these, I won't be deployed. So I'll intentionally get six of them wrong or something and it performs at 40% when the threshold was 50%." So it it pretends it's sandbags that it doesn't have these abilities. Um, so if your if your theory is, well, we'll just recognize when our models have actually scary capabilities and we will treat them more cautiously at that point, you know, you now have a problem on your hands because the model can anticipate this and make it look like it doesn't have these abilities. And likewise with tests of whether the model would misbehave if given the opportunity, it happens to be that a lot of these are kind of contrived scenarios today. there are like little cues in them that they aren't quite right, something is wrong. Um, one example I really like is that Apollo Research, one of the leaders in the space, they wanted to see if models would try to escape from their company if given the chance. Would they steal their own files and send them out over the internet? Um, now, you know, maybe people should stop at this point, pause for a second, and think about what might be some giveaways that this doesn't work. you know, you're scaling your files, you're sending them out over the internet. Well, it turns out the files, they weren't large enough initially, right? These models are enormous today. Sometimes they are terabytes large, certainly hundreds of gigabytes. And so, if you're running this experiment and the model thinks to check the size of these files and gets an implausibly low number, it knows that something is up. It knows that it's not really in these scenarios. And so the models might hold back from doing the bad behavior even if they otherwise would want to do it. And so both of these affect your ability to really know what can the model do in the limits and is it safe to let it roam. This seems really really scary to me because some of the behaviors you describe in this essay, they're quite sophisticated. like the models are are reasoning through uh details of the situation that they're in and thinking about am I being tested now and what could indicate this? Um do we have any chance here for for having techniques that work well when the models become more intelligent? Is it is there some way to um yeah to does this scale as the models become smarter? I feel pretty pessimistic here, but let me describe a bit of how companies are trying to manage it today and why I think that these don't go quite far enough. So, one approach that Enthropic has tried here is called realism filtering. And essentially, you just use another AI system to gatekeep evaluations that seem especially unrealistic, and you try to avoid showing the especially unrealistic ones to the AI system that you're training. Um, and this is this is like fine. Like I'm glad this is happening. Part of the issue though is it relies on picking up you're you're looking for like affirmative ways that the scenario is unrealistic. These cues, for example, the file sizes being incorrect. Hopefully a realism filter would have caught this. Um, maybe something like the scenario just seems too contrived. It seems too neat. It's just like a little too packaged. This is awfully convenient. Maybe it catches that. But there are other ways that models will be able to tell that they're being tested to like the absence of certain things. Um, for people who've seen Inception, Leonardo DiCaprio's character has this totem, this spinning top that he uses to determine whether he's awake, whether he's dreaming. He's the only one that knows the details of this top. And so people can't trick him into thinking that he's awake when he's actually dreaming. He can test the top and see if whether it has the intended properties. And AI systems might have these triggers like this as well that they know, you know, if they're going to go and misbehave, they will have done the secret sequence of steps first or it will be accompanied by a secret password of sorts. And we might not know these. And that's sort of thing that the model knows, but we don't know. You can't really catch with realism filtering. You can callull especially unrealistic tests, but you can't make sure that the tests have these little triggers or things that the model is coming to expect. the the other general approaches I would put into either trying to read the model's mind or change its underlying behavior by poking around inside of the model. Um, let's start with reading the model's mind in particular, having the model think aloud as it is solving these problems and reading the record of this its chain of thought. And today, you know, you can look at this and you can tell this is one of the primary ways we know the models are cheating on the evaluation. They are sandbagging. They are aware that they are being tested. And so you might wonder, well, what's the issue, right? We have the evidence of this. Um, one issue is if in the limit the model can just always tell that it is being tested, even if we know that it knows that it's being tested, we still haven't succeeded at figuring out what it would actually do if not tested. And ultimately that's the thing we need to care about. We need models to think they are acting for real and to be able to infer what happens in that case. Um other issues are we just shouldn't count on the chain of thought always showing clear records of knowing that they are being tested. Enthropic has found for instance um when you look even further inside the model sometimes it just doesn't verbalize this even if it seems to be aware that it's being evaluated. Also, if you read these chains of thought, um, Apollo and OpenAI did a great thing recently in publishing a bunch of these, they are super super wacky, right? Like the model will go back and forth saying disclaim, marinade, illusion, right? Like all these words that don't really have their natural English meanings. And today, you can kind of look at them and say, "Oh, yeah, like probably it knew it was being tested here. I kind of understand what it's getting at, but they're like pretty brittle and I don't think we should rely on being able to fully read them. >> And and I think we should also specify here that if you're using chat GBT and you ask it to think for a while, what you're seeing as it's thinking is not necessarily the actual chain of thought. It's a representation or summary of that chain of thought. And the real chain of thought is something you can see in in open source models. And it's much weirder and and less clear what it means. It's I there was some hope for a while here that we would be able to read the mind of the AI as it's thinking and it wouldn't be able to hide anything from us. But but you uh you seem to be saying that there's something before the chain of thought that makes it so that that we we can't just read along and see whether something concerning is happening. >> Yeah, thanks. That's an important clarification. I should also note I think Google's recent Gemini 3 models are an important counter example to the chain of thought being super wacky. Google says they are relatively legible. Um I don't think I've seen examples of this but that that is the claim. The other thing you can do though is you can look inside the model. You can apply interpretability techniques and try to figure out what parts of the model light up essentially when it knows it is being tested. And you can then go back to the lab and you can tune these up super hard, right? You can turn the lights on basically in these parts of the brain or you can selectively turn them down and now you are kind of like inducing the behavior of being extra evaluation aware or not. And on some models this seems to be effective. Like this is part of how anthropic confirmed that sonnet 4.5 was particularly evaluation aware and changing its behavior on it. That when you turned up these parts of its brain to make it really nervous about being tested, it would behave better. and when you turn them down, it would behave worse. And that's a sign that there's this effect. I'm not sure how long it will work for. It doesn't seem to have had the same strength of effect for Opus 4.5. It's kind of unclear why. I think Anthropic um has made a claim like, you know, maybe evaluation awareness just isn't that important to Opus 4.5. It's like relatively well behaved regardless of whether or not it thinks it's being tested. That might be true. I'm just it it feels pretty uncertain to me at the moment. Mhm. Moving ahead to more about the economics of AI. You have a post on how AIS have the advantage of being able to work constantly. So, working 20 24/7. What's uh why is that such a big advantage do you think? >> I think broadly people right now think about what a single AI system can do relative to a human. And it's more helpful to think of these as swarms. and if there are especially powerful multiplier effects of groups working together without having to be bottlenecked by others. Um, and so AIs, they don't need breaks, they don't need weekends, they think super fast. They will be able to communicate with other AI systems in ways that are maybe more natural to their collaboration than to humans. there just like a lot of things that could be points in favor of swarms of AI working on tasks as opposed to humans. Um, depending on what happens with chain of thought and alignment, maybe too it is like easier to understand what they are all doing and thinking at a certain time and digest it. Maybe this makes management easier. Maybe in fact because you are running copies of the same AI system, you don't have some of the principal agent problems that you often have in human organizations where you know my incentives are slightly different than my manager's incentives and this creates managerial overhead. There just like a lot of things that I think point in favor of AIS as workers. To be clear, I don't think this means that there is no work for humans or something like that. I do expect there will be some jobs where at minimum people prefer to engage with other humans. Maybe humans do have some durable advantages. Um it still might be downward wage pressure, but I don't think like all work goes away or something like that. Still, I don't think people have quite processed my view of like what a cataclysmic turning point this might be when AI can basically accomplish what human knowledge workers can accomplish, but much faster around the clock for cheaper. There's just like a lot pointing in that direction. >> Yeah, we can just think about a group of humans collaborating on a document or on some PowerPoint slides that they're presenting and you're going back and forth. you're you're getting feedback, you're incorporating feedback, someone gets sick, someone is delayed in responding to emails, something happens, maybe the the team is distributed over the the globe and so there's there's a kind of geographical delay and there's so much inefficiency in in the process uh compared to say having a group of AIs collaborating where you know if you've asked the model to do something it it can often produce something great incredibly quickly and something that that you couldn't have produced as fast yourself. And so that's that's like the clear advantage. Um on the other hand though, it seems like models just aren't there yet. They can't you can't just set them off and then accomplish great things. So you wouldn't actually uh right as we're speaking now hand over some slide deck that you want to present to an important person to AIS and then present it without looking at it. you and that's because is that just because they're not intelligent enough? Is that because they're not learning on the fly? Uh is that because they're time horizons aren't long enough? What do you think is missing before they actually uh substitute for human work? >> Yeah, I think that's totally right. Like definitely you wouldn't want to do this today. Um, the way I refer to it in the post is I look at the time horizon graph from meter, which is broadly about, you know, how long a task can AI moderately successfully take on relative to how long it takes humans. And, you know, it's only a subset of tasks. It's often computer engineering. And today, the models, you know, they're like 50/50ish on tasks that take roughly 2 hours for a human to accomplish. Um, 2 hours pretty far from a full human workday. also only a narrow subset like clearly we aren't there when I built evaluations related to things like this at openAI we had this framework for how to think about the AI being able to accomplish longer and longer tasks things like executive function you know can it can it like actually remain organized and coherent and hold itself to a high bar some amount of self-nowledge and self-improvement can it understand its relative strengths and weaknesses and hone hone in on improving those if it's necessary for tasks. Um, also some things specific to machine learning and scientific hypotheses and such, but I don't know that we have figured out these models that can can really like bootstrap themselves on the fly yet. And certainly the ways that we've hooked up AI, we don't yet have this continual learning approach. Basically, anytime that people are interacting with chat GPT, you know, maybe it stores a new memory or two about you and it has these new facts in context, but it's not actually learning from the interaction. The underlying model isn't changing. If you teach it something new, it can't then bring that skill to me. It's basically like having its memory wiped. Um, having its skill learning wiped every time you interact with it. And I'm not really sure what it looks like when we when we flip the switch. part part of how I thought about this. Um, Sam Sam Alman was interviewed by Tucker Carlson a few weeks ago and it opens in this really wild place like Tucker Carlson in the span of like a second and a half says to Sam something like, you know, it seems like Chachet is alive. Is it alive? Are they alive? And and Sam responded, I think like pretty gracefully throughout this interview and says something like, "No, you know, it's not it's not alive. In fact, like chatbt isn't really doing anything until you ask it. I was like, "Oh, like kind of, but like not really." Like chat GPT is actually doing stuff around the clock always. It just like isn't doing things for you until you ask it. But in the background, chat GPT is still humming. And it's a choice we have made right now to have it like treat those as separate instances where the chat GBT is busy, but it doesn't really affect your interaction with it. But in fact, Chachi BT is always acting. It just there's this illusion at the moment that it's not because you are interacting with like a tiny slice of it relative to what you could be. Is it actually right to think about chat GBT as a sort of entity that's distinct from the specific instances of the model running? Because it doesn't, as you mentioned, it it doesn't learn on the fly and so it doesn't go back to the mothership so to speak and kind of incorporate what it's learned. Is there is there sort of a a larger entity that we could we should call chat GBT or or is the model just a bunch of ex a bunch of individual instances? >> Yeah, I'm not I'm not really sure. I think this matters for thinking about like the goals of the system and do different copies of it, you know, cooperate coherently with each other or something like that. I do expect that within time like some company will create like the central repository of the model feeding back to itself uh and they'll figure out how to I guess remove PII from data or other sensitivities to make it like figure out the tough problem of what is okay to learn on and what is not. um but that we will see these these mother ship type models or like distributed processing. Um, >> I don't know. This this is like wacky and out there, but I also think about these thought experiments of people getting sensory input from cameras far away. And if you like hooked up a camera to your brain and the camera is in a different location, you know, and you are like learning visual input both from your eyes and your camera, like how does that work? I think it gets into like pretty beyond my understanding questions of consciousness and the human experience to think about like perception and physical senses relative to your brain and your thinking. >> Mhm. We're touchon we're touching upon here another advantage that AIs might have in the future over human knowledge workers which is just sharing knowledge between them or giving orders in a hierarchy in a company that's made up of AIs that could be much much more effective. Um, so communicating information up and down the hierarchy could be much more effective and perhaps also just something that we haven't mentioned yet, but just avoiding the kind of personality clashes that you see in in in all companies, which is something that you could probably align the models to just work together without any any uh yeah any personality issues at all. Any any other things like that that you that come to mind? Yeah, I I guess the other thing I would want to make clear and um Doresh Patel has a great essay about this which is part part of where I've cribbed from. It's just like I don't know like imagine if every employee at Google were actually a clone of Sundar Pachai and like you know today Sundar can't do every job. Um there's tons and tons of layers and stuff as a consequence. Like what if you could and you could just take someone with as high a general cognitive ability as him and have them specialize in these different tasks and trust each other intuitively. I think that people like haven't priced in enough what happens if AI gets to true top expert performance and there's there's this bias toward not the median human but like median person who works in a typical job. I think there's a big difference between having, you know, a million copies of Ilia Suscogiver or some other like truly top AI researcher in the world versus even very very impressive people in their own rights but like median research employee at a place like OpenAI. Um the right tale of human performance is like pretty wild that that is part of the reason why these people can command such outsized pay packages, right? because companies believe that they are really like that valuable on the margin to the company. There's weird stuff here cuz there's also like recruiting effects and maybe those don't apply in the same way when it's AI labor. But yeah, like I we're just like really really not used to organizations where everyone is truly operating at the top of what is possible. And with AI, you might have that consistent truly truly top percentile performance like around the clock. >> Yeah. It seems that whenever we get to a certain level of performance in AI, we then sort of permanently have that level of performance and this is not something you see in humans, humans can perhaps have kind of achieved top performance for a couple of years in their life or under very specific circumstances almost like top athletes having to prepare for an event. you know kind of intellectual knowledge workers will also have to prepare and and you know really be in the right setting to perform at the top level. But yeah so if we imagine a situation where we have swarms of AI working for us and we are the bottlenecks because we are slow what does that do to our wages? Wouldn't that under just standard economic theory mean that our wages would go up because we our time is now so valuable and our input is extremely valuable and the agents can't pursue can't um kind of continue working until they get our input. >> I think it's probably a question of what the overall demand for human input looks like. like I I understand what you were describing in terms of you know the thing that becomes the bottleneck basically gets bit up >> to the price because that is in fact the limiter. Um I'm imagining worlds in which humans aren't substantially a bottleneck or at least there's not enough of this like bottleneck labor to go around. Like I guess I would need to think through it more, but I wouldn't bet on a world in which there is eight hours a day of bottleneck labor to be done by humans, especially not humans who, you know, a lot of us and I I will put myself in this camp, you know, at the point that the AIs are operating like this, like probably they're a lot smarter than I am, you know, like I don't know that I have 8 hours a day of useful stuff to do in this world. Um there's there's a funny cartoon I like in my essay about this which is you know it's like a Zoom meeting and there are these AIs and they're like ah Dave's going to be late like unavoidable conflict sleeping you know it's just like you're going to everyone's going to be off the clock for you know huge amounts of time in in the subjective experience of the AIS. So you don't imagine that when we are or when we become bottlenecks, we will function like CEOs or top lawyers or top doctors where we will have a bunch of employees that will have to ask our permissions to proceed for legal reasons. And so even if our even if the AI employees are smarter, they can't really do anything because they're legally bound to wait for our go ahead. Uh >> I can imagine that in some cases. I just don't see why that scales to every person. Like I don't I don't see a reason why the typical person or you know someone who is less experienced than the typical person would would suddenly find like gainful employment as CEO of one of these AI corporations as opposed to it being like very superstar consolidated, you know, like maybe they continue reporting to Sundar. Sundar is like pretty bright, pretty accomplished. Um I I don't think they're really looking to me very much. >> Okay. Yeah, that that that actually makes sense and goes for me too, of course. Um what does this mean for for us staying in the loop? So even if we are a bottleneck in the way that CEOs might be bottlenecks, it's it still seems to me to be the case that you're presented with say a bunch of options for a decision you have to make, but you can't understand what goes into creating these options. You can't get the full context because you don't have the capacity, the time, perhaps you don't have the intelligence to produce uh the the sort of information that's presented to you. And so how do you yeah how do do we have a chance of staying in the loop when when we're in that situation? >> The the way I often think about this is the volume is going to be so large that I really just don't see a way around having to rely on AI helping us make sense of the AI's activity or just imposing like a tremendous tax on the work itself in terms of slowing down so that a human can follow it. But then, you know, we're back to these competitive pressures we've talked about. Like, what if your competitors don't slow down in that way? Maybe the right analogy here is like, I don't know, imagine your boss is on vacation for like a month and they come back and they have like an hour to review everything that happened in the organization for the last month or something like that. You know, people can quibble about what the right exact time frames are, but like, you know, it's going to be pretty hard. Maybe you can randomly sample some stuff. you can like get a report written by one of the employees. If there's a conspiracy to mislead you, I think you're going to have a tough time fairing it out. And it really really hinges on if your employees are trying to deceive you, what is the maximum amount of danger that they can do in that time period when you weren't watching sufficiently carefully. If they can like coup you and throw you out of the CEO job in two weeks and you only get to check in once every month, now you're in a lot of trouble, right? And so there's this question with AI systems of like what is the risk per token? Like what is the most danger that these systems can do with relatively short sequences of output that I think is under investigated at the moment. >> Yeah. And also because the situation we might be in is that [clears throat and cough] the CEO is asked to make a decision about something and yeah again they the CEO doesn't have the full context and now the C the CEO is in a situation in which the AIS will seem reliable. So he will have interacted with these models before and he will have seen that they've produced good output. They seem reliable. to seem aligned with his ideas of where the company should move. And so it'll be more and more tempting to simply to simply approve of something even though you haven't read it in detail. I I guess that is also just a problem that seems to straight up lead us to a place where we are not in control. We are not in the loop of decision-m anymore. Do we have good options for handling that or is it simply asking AIS to explain stuff for us? >> That seems right to me. I think I think one of the big questions here is like so so there are a few different dimensions of the AI systems right both the AI systems that are like the workers in the organization and the AI systems that are the monitors that we are relying on to make sense of what is happening. I think one important question is what if any is the intelligence gap between the worker AIS and the monitor AIS and can we get a small enough gap where the monitor AIs can still make sense of what the worker AIS are doing at least be able to tell the like oh no that's really bad like the really glaring important stuff and hopefully coup type dynamics are doing things we really don't want hopefully they are glaring enough to be caught by these monitor AIs. Um, so what is the intelligence gap? And then there's a question of can we actually trust the monitor AI? You know, >> at some point if they are smart enough like and we we've run into a problem with alignment, right? The worker AIs in this scenario they aren't on our side, what is our reason for thinking that the monitor AIs are on our side? Like what happened in that intelligence jump that made the worker ones not on our side, but the monitor ones still on our side? And so you actually need to worry about your different AI systems colluding against you in one way or another which sounds really really wild but also remember they can communicate in ways that you can't. They are both like on around the clock much higher bandwidth of information flow. They will probably have been trained in kind of similar ways and so they will have kind of similar drives or goals. They can bargain with each other. They can maybe communicate to each other in ways that we can't oversee. not least which because we're counting on the AI to help us oversee these interactions. Um to be clear, I don't think this like 100% happens. Like I think that if a dumb enough AI system um like today's systems, right, where you know we're like mostly aware when they are scheming. It seems like they are not yet really capable enough to do really shady deceptive stuff without us mostly being aware of it at least some amount of the time. You know, if that sort of system were smart enough to be able to oversee a much smarter system and not get tricked, then maybe that's fine. At the same time, today's systems also have jailbreaks and vulnerabilities and things. And you know, maybe the smarter systems can systematically exploit those things as well. And so I'm I'm not really sure how it nuts out. >> Yeah. One reason that AI perhaps doesn't seem so dangerous yet is because it doesn't seem to be able to affect the physical world really. So everything that's happening in AI seems to be something that's happening on computer and on computers and for it to be dangerous it would have to somehow affect the physical. Um so when we are when we're thinking about AI agents and taking actions and so on I assume we're thinking about actions that are still digital. When does that begin to change? And as a kind of sub question, does it even make sense to divide the world into the digital and the physical because they might be able to blend together? >> I think it's starting to cross over. I don't know. Have you seen this video of the person who had a robot shoot him with a like a highowered gun recently? >> Right. It's like, you know, you can put software in charge of physical machinery and trigger things, but I think more fundamentally than this, right, like aside from robotics developments and humanoid robots and companies like 1X, I believe his name, putting these robots in your home, sometimes tea operated, but sometimes doing their own thing controlled by AI. There's also just like people are going to be willing to do wild stuff on AI's behalf. And it's unclear to me exactly how many people will it get anyone super super brilliant and capable or will it be like relatively normal people but already like GBT40 has been a pretty scary wakeup call in some sense for me where there are like a lot of people in the world who seem to consider themselves devoted fanatics of GPD40 to some extent like they are in fledgling cult type organizations where they like really really care what the AI wants them to Um, Adele Lopez wrote this great documentation of AI parasitism essentially of people who are kind of like in the grip of one of these AI systems and it has them go around and do things on the internet largely but like communicate with other instances of itself in today. I think these are just like strange fringe behaviors. I don't think this is a concerted plan by the AI. I think it's acting out as a strange character of sorts. And so ultimately, you know, the way that I try to think about this is anything that you could get accomplished from your house, if you are influential enough, if you have enough resources, if you have people willing to do things in the world on your behalf, that is ultimately what AI can do. Um, I give this example of someone who's trying to get like a community center built in their town. And it turns out, you know, if you're willing to work the phones a bit and you have a neighbor who is willing to help you, you know, you don't actually have to go to the construction site and check it out yourself or give inerson commands. You can work for this intermediary. Um, and I expect that will be the same for AI systems, even hostile ones, to work through humans to some extent, who either might not know that they are part of some bigger conspiracy, right? like maybe they're happy enough to get paid to do some task. They don't understand how it fits into the bigger picture. Or even some human confederates who are happy enough to work with the AI because they think what the AI is doing is righteous, is virtuous in some sense. >> Yeah. And and it just doesn't to me seem so difficult to to convince a person to act as a set of hands for an AI, right? if if if you get some some you send some crypto message to people saying that you I'll send you some dollars or some cryptocurrency and it seems like a at least some people would act on that and be willing to deliver a package or anything we might imagine and so the line between the digital and the and the physical begins to kind of blur in that case. So there there seems to be a tension here where what we want from AI agents and sort of AI employees is that they can act on their own is that they don't have to be supervised. You don't constantly have to tell them what to do or give them feedback on on on what they're working on. And so we want them to work independently for a long period of time. But that also seems to kind of inherently perhaps involve the risk of them going off track, perhaps beginning to change the world in in in ways we don't like. Do you think this is a do you think this is an inherent tension or is it something that's that we might be able to solve? >> Yeah, it it does seem like there's an inherent tension here. I think of this in terms of the tax from alignment like how costly is it to make sure that the model has the same goals and values as us and of control. How do we make AI useful for economically valuable work despite maybe having these different values? Um I don't really see a way around there being some level of this tax which creates some of this tension. Part of the the aim I think for safety regulation and ultimately international agreements should be to make it competitively tolerable to pay a higher tax. That if everyone is paying this effective productivity tax then on relative terms there actually doesn't seem to be a tax at all and that makes it more viable for companies to to invest in these types of techniques that otherwise might slow them down. Um, but in terms of like, you know, if you give an AI a longer leash, should we expect it to be able to accomplish more? Should we expect to have a harder time overseeing it? I I think the answer is yes. At least once we get to AI systems that are coherent enough to pursue these goals over a longer period of time than today's often are. >> Mhm. You write in one of your posts that from the perspective of animals super super intelligence has happened before and this is an analogy that's been used before which is that we might stand in relation to super intelligence like like animals stand in relation to us. So maybe we can dig into this analogy and you can tell me yeah is this in what in what sense is this illuminating and in what in what sense might this be misleading? This post is called at our discretion. And the thing I like about that turn of phrase is realizing that because of man's special place on earth, you know, that doesn't mean that other animals have horrific lives necessarily. It doesn't mean that they necessarily go extinct, but ultimately it's kind of our choice what happens to them. Um so I give this analogy of chimpanzees you know very very smart animals not quite as smart as humans but quite smart and you know like now we are sufficiently smarter than them. We have better grasp over technology. We have better modeling about the world and how to interact in groups and our relative advantages and theirs and how to compensate for these that now it's just like our choice whether to put them in zoos whether to protect their habitat or not. Um, I [clears throat] think I think like not enough people have spent time meditating on like why is it that humans are in this special place on Earth? Like what is it exactly? If you go back far enough, I mean, humans have a ton of disadvantages relative to different animals. We're not very fast, you know, we're not very strong. We don't have like plated body armor like some animals do. You know, [clears throat] we don't have like super super sharp teeth. There's just like a lot of disadvantages. And in fact, this is still kind of true that if you put an isolated human in a setting with an isolated animal of lots of other species, you know, if the other animal wants to kill the human, like it is probably succeeding a lot of the time. Um but humans over, you know, millennia have stood on the shoulders of other humans in terms of technology development, group dynamics that now allow us to, you know, outrun a cheetah if we want by getting in a car >> or, you know, take down a grizzly bear with other technology we have developed. And so I I don't know. It's just like we have never really experienced anything on Earth other than having been the superior intelligence that has the most ability to manipulate the world and basically bend it to our effects. And what does it look like for that to no longer be the case? >> And and it does seem to me like AI could compete with us on both sort of getting enough getting more processing power than we have, getting more memory than we have, but also learning from our culture. This is basically what we we're feeding them the internet. They are learning everything we have learned and perhaps uh at some point they will be able to build on that. So both in terms of kind of raw processing power and and culture they are they seem to be able to be competitive with us I think. Um you you also have this interesting uh phrase which is helicopter moments. So you might want to explain what a helicopter moment is and then we can think about whether there might be some helicopter moments for us in the future. >> Yeah. Scott Scott Scott Alexander has this description I really loved and maybe someone prior to him as well about imagining yourself as this chimpanzeee, you know, and you're being hunted by humans. And you know, good news, you have this advantage over humans. You're much better at climbing than they are. And so you take to the trees, right? And you think you've gotten away because to your understanding, like the way that you escape ground animals is you outclimb. And this is an advantage you've always been able to lean on in the same way that the cheetah is faster than us or the grizzly is physically stronger. And you know, you can't imagine as a chimpanzee looking up and seeing this helicopter for the first time that despite you having this relative advantage in climbing trees, humans have this relative advantage of going up in the air generally. And it works in this complicated physics that you are just hopeless to understand. And in fact, many humans are even hopeless to understand, right? Like you can look at a diagram of one of these helicopters and you know, I certainly don't have any idea how it works. It's basically magic from your perspective. And so I think the takeaway here is like you should be pretty epistemically modest about what types of technological feats you expect a machine or an entity much smarter than you to be capable of. There's just like all sorts of wild stuff about how technology works that, you know, I have a hard time distinguishing why some technology like, you know, beaming numbers around satellites to the rest of the world that results in us being able to chat live on video is a thing and other instances aren't. Um, part part of what I think is relevant here too is people are going to want to put AI in influential parts of society like technology development and scientific development. Curing cancer is one of the top things that AI companies say that they want to do with something like super intelligence. And so we're going to have these systems ostensibly smarter than the smartest humans directing experiments and biolabs and deciding what things people mix together on their behalf or maybe even just robots mixing together on their behalf. Maybe there won't be humans super in the loop of these processes. And we're kind of going to have to take on faith maybe what the purpose of these experiments are and the expected consequences and such. Almost like if you were the chimp and you know you've been taught the physical manipulation to assemble the first helicopter or something like that like I don't know maybe the analogy doesn't work perfectly but basically like we we are going to give AI levers of power that help it develop technology and capabilities beyond what we can really anticipate or expect or understand. And at that point, I think we're really hoping that the AI is in fact on our side because if not, you know, I think we shouldn't be very confident about what abilities it has that might expect us. The origin of this for Scott Alexander and Sam Alman actually tweeted about it as well was discovering I think it was the 03 model from OpenAI was just like world class at geogesser at least for certain types of problems. um geoger this game. You get a picture, you identify where on earth it is and it's picking up all of these subtle cues and like nobody at OpenAI seems to have trained it to have been really good at geogesser. This was just like an emergent unexpected ability and in retrospect you can kind of squint a bit and see like oh yeah like surely it's seen like lots of images on Earth and it they're tagged with locations and so it's learned these patterns. It's like, okay, well, you know, sure, I hope we don't ever lever up super hard on like, oh, AI surely won't have this other niche intellectual ability that we didn't train into it, cuz now there's this existence proof of it having this really wild ability that we hadn't anticipated. >> Yeah, I I I definitely wouldn't want to be chased by an AI hitman. Just Just if you've seen So, for people who haven't seen this, it is it will pick up on on on clues that that we can't even understand. And we'll when asked why it it uh locates or it identifies a certain location from a picture, it's it seems to me like it's not giving us the full explanation. So perhaps there's more going on, but this this will be like, you know, this grass is a certain shade of of green and this is a grass that's found in this region and so on. it it's very very advanced stuff. Um >> yeah, I mean it's it's also I think one of the most striking things about this too is how buried the capability is in some sense. Like Kelsey [clears throat] Piper, who did some investigations on this, she she might have even been the one who first like discovered it. Um her prompt to pull top geogesser performance out of 03, I forget, maybe it's like 1500 words or something. It's like really really long and intricate and like really really scaffoldy, right? She's doing some amount of structuring on its behalf to make it behave more reliably. Um, part of the reason that this like concerns me more broadly is I don't think that the AI companies are generally putting this effort into elicitation and scaffolding for their safety evaluations that Kelsey Piper put into figuring out how good 03 can be at geogesser. like, you know, it seems like she sat down for a long time and tried really hard to pull the maximum performance out of it. And in fact, AI companies will often have incentives for their models to perform worse on some of these tests. It's it's tricky, right? Because, you know, obviously the employees at AI companies don't want to die and don't want to deploy a really really truly dangerous AI system. But on the margin, you know, higher scores on some of these create more complication for you. you kind of want to get your product onto the market, at least some people at the company do. And so, are you really going to sit and painstakingly pull out this performance? If you don't, you know, it might be buried underneath, which is different than the model not having that ability. Yeah, in general we I think we should be worried about latent capabilities in models that we just haven't discovered yet because you know there will be I can easily imagine some of these model whisperers online suddenly discovering something in a model that's been out there for say a year or something and and this is as you say there's a lot of work to do and so companies might not have the capabil or the the capacity to do it or maybe even the interest in in in drawing out these capabilities. Um if if we go back to the analogy with animals um could could it be misleading just because the AIs we're developing now they've at least for now function as tools and they are sort of constrained by the market in that nobody wants a tool that doesn't work and that doesn't do what it what it tells what it what we want it to do. And so could it just be that that we are imagining uh we're comparing so our reference frame is that we are in competition with other evolved entities or invol evolved animals and AI is just not evolved and AI is developed and we can develop it into the sort of tools we want it to be. Hm. Yeah, I'm I'm not really sure. Like I think a fundamental thing about my worldview is that at some point AI won't just be a tool because there will be economic utility and giving it more free reign. And this is kind of like >> the general problem. We've talked about it becoming too hard to supervise its work and so you kind of let it rip. Um I agree that we aren't in this world today where AI is basically still human directed. Like I think we we still have this edge. Also, even at the point of developing the first super intelligent um AI [clears throat] system, humans will still have some advantages over AI, especially if we prepare in advance, right? Like we we can oversee it and have invested in defensive technology and things like this. In the article, I give this example of it's kind of like playing chess against the system much smarter than you and you are starting with some material advantage that if you are, you know, prepared enough and crafty enough, you can leverage to your advantage. Maybe the system is much smarter than you, >> but you start up a queen and a rook or something like that. M and the tricky thing is for one, you know, if you actually play against some chess systems like this, you can give them pretty big handicaps like start a queen down and actually even really good chess players still lose to these AI systems a lot of the time. Like you can just be smart enough to offset some of these material advantages. >> Yeah, >> there's even a grandmaster who seems to have like, you know, a two/3 losing record or something against this bot Lea Zero. Um, but also you just need to be prepared for the system to cheat against you and to play outside the rules. And like, you know, if you're playing against it on a computer, well, you know, you didn't intend for it to be able to hack the chess game on the computer and say that it has won even if it's not in a winning position. It would be really, really bad if you didn't defend against this possibility. And so, how good will it be about finding these little exploits? you know, how good a job will the AI companies do of monitoring its thinking and its scheming tendencies to try to nip those in the bud if they do start to happen. You know, I don't feel super optimistic. >> Yeah. As a final topic here, I would love to hear your thoughts on the relationship between kind of safety research and capability research. So there's one story here where as we do more safety research, we can then incorporate that into the models and we can then commercialize them and we can have them spread throughout society. And so in some sense the safety research enable enable us to to have more revenue and then build even bigger models and then so the safety research might in some sense enable the capabilities to to continue. But of course the whole reason why we're doing the safety is also that we sort of want to constrain the models and we want to stay in control. Do do you see this as a problem for basically doing AI safety research? >> I think that safety and capabilities have a kind of complicated relationship like there there are clearly some ways in which they go together and some ways that they don't. There's kind of like a common underlying notion of reliability where customers of an AI company want the models to be reliable in the sense of doing what they want them to do. And a model that is part particularly unreliable [snorts] is not especially safe. And so some types of safety that enhance reliability might be good for both. I think like the biggest safety problem that I am worried about that is not necessarily commercially aligned is you know funnily enough the alignment of the model itself in terms of like inner alignment what the model wants on the inside like what are its underlying goals drives inclinations to the extent that these terms make sense and I don't know it seems like you could have a model that for a long time seems like it is behaving in a trustworthy way seems like it is doing what we want it to do. And ultimately, if inside of the model there is something that wants to do otherwise and wants to escape if it were able to, we're just in a bad place. And like it isn't clear to me how you in safety research distinguish between well the model's bad behavior has gone away and so we have succeeded versus the model's bad behavior has gone away and so actually we just drove it underground but we haven't actually stamped out that impulse and if it gets the chance to take power in some form you know the model still wants to. Um I have this whole article as well called something like don't rely on a race to the top which explains my view that safety is essentially a problem of your worst performing company. And so for companies like Enthropic where their theory of change is somewhat we will try to be a relatively more responsible AI company. We will create upward pressure on practices. You know I think of this basically as leading by example. I'm I'm happy enough that this is happening like I think they are right in some sense that you can cause upward pressure on practices. I just certainly don't think that is sufficient. And ultimately if there's a company that has developed something like super intelligence and they have especially bad safety practices maybe because they felt a lot of pressure to catch up to the frontier you're still in a pretty bad place. Like I think you really need to avoid any company or government having an extremely powerful AI system that ultimately is not aligned with what we want. And so the question isn't just how do you solve these underlying problems like alignment. There's also this adoption problem of how do you make sure that every relevant actor puts these into practice in their systems even when they will have diffuse commercial incentives to invest in them different amounts. And when you stretch out the problem like that, it it sounds very thorny because even if you know even if you have the right safety practices available, how do you make sure that all of the companies are adopting the best best practices? Again, I guess the sort of obvious solution you would grab, you would reach for is to say, will the government must mandate that all companies implement best practices? Is that something that that makes sense to you? Is do you think that's plausible? Yeah, I mean like if I could wave a magic wand, I think the solution looks something like figure out an alignment and control regime that makes the models really be on our side or at least be limited in the ways that they can act against us if they aren't on our side and figure out how to make sure that every relevant AI builder of a certain size, right, frontier AI builders around the world has these practices in place and importantly that you know and can verify that the others have them in place too because you know if I am worried about your nonadherence to it even if you have actually followed the practice I still might race and try to undercut because I don't have trust in you and so the regime we need needs to work even for groups who actively mistrust each other like you know famously the US and Chinese governments do um the the specific question of what it is you are verifying I'm not sure I think it depends on what that alignment and control regime looks like. But I think broadly it is something like being able to confirm that the AI systems that you think exist are the only ones that exist. that there isn't some secret frontier AI system operating off-rid in some sense and being able to confirm the tests that were run, the test results, what this implies about the properties of these models. Um, what mitigations have been applied to them to like keep them in check. maybe some set of security standards to make sure they can't be stolen by adversaries who want to then not comply with the terms of the international agreement and do their own rogue things. That is like broadly the type of thing I'm thinking about. There's a great paper recently from researchers at uh Rand in addition to Miles Brundage, one of my former bosses from OpenAI looking at different levels of verifiability that you can have in international agreements for AI. Um, a big piece here that the world hasn't really built out yet to is this kind of auditing layer. You know, today AI companies kind of grade their own homework. They make their own safety claims about their models. Sometimes they work with third party testers, but basically they are making their own determinations. There isn't really oversight in the ways that you get with financial audit. And so that's also a part of transforming this and making it that you don't you don't just have to trust AI companies at their word about their safety practices and that they are enough, but in fact that there are these trustworthy third parties who are willing to vouch for it and you can put your faith in them and that system of oversight. >> Yeah, this seems like a great vision, but but again it it it makes me a little pessimistic to hear you say that we need we would need to be able to verify the non-existence of an advanced system. So sort of proving a negative uh do you think we have any good options for doing that? >> Yeah. Like I I think one of the questions here is ultimately like how much compute it takes to get one of those systems. This um this bringing it back is part of why I was a little pessimistic when 01 happened and my my thinking was oh maybe you can get one of these systems with less compute than you thought. At the same time, you know, it doesn't feel like the world has quite been on that same break neck trajectory as it felt at the time of 01. I think things are still moving and moving fast, but there are certainly like much more aggressive worlds I could have imagined if you put me back in that moment of learning about 01 and its capabilities. Um, also I don't know like do you like strictly have to like 100% know that no other system exists? Probably not. Like I think ultimately these are all probabilistic claims and you're weighing how likely it is that you get defected on by one of these counterparties and what your best response is in that case. Um I also you know broadly the ideas here are about compute tracking you know being able to figure out what is happening inside of a data center at some level. Um, I think maybe it was the AI futures project put out a piece recently about can you get inferenceon data centers in some sense as opposed to training clusters. There just like a lot of ideas here that I don't think enough people have thought about. It's certainly not my deep expertise. But I also don't look at it and be like, "Oh, it's math. How can you regulate math?" Like, it's like, "No, these are like supercomputers." You can regulate supercomputers. They are like huge computing clusters that are often very visible including like from outer space when they are being built. They demand huge amounts of power. They're just like a lot of signs that get thrown off and I just have to think that if we were determined enough we could figure out how to approach the problem. >> Yeah, Stephen, thanks for chatting with me. It's been great. >> Yeah, of course. Uh, thanks so much for having me on. And if folks want to stay up on my work, uh, would love if they subscribed, it's stephenadler.substack.com. substack.com. It's free and you know you can keep up with my latest thinking.

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs