Library / In focus

Future of Life Institute PodcastCivilisational risk and strategyFeatured pick

How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 85 full-transcript segments: median 0 · mean -3 · spread -21–5 (p10–p90 -10–0) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.

Slice bands

85 slices · p10–p90 -10–0

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

- Emphasizes alignment
- Emphasizes catastrophe
- Full transcript scored in 85 sequential slices (median slice 0).

Editor note

Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.

ai-safetyflicore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video 27uxAIQLj-k · stored Apr 2, 2026 · 2,178 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/how-to-avoid-two-ai-catastrophes-domination-and-chaos-with-nora-ammann.json when you have a listen-based summary.

Show full transcript

I think the next 2 years to four years will be pretty path defining. I think we need to build workflows tooling infrastructure that allows us to without needing to fully blindly trust the system trust the outputs it creates. The crucial question is like how can we arrive at justified confidence in those outputs and if we do find ways to do that scalably then I think we're in a good position to build a world that unlocks a lot of flourishing we will get massive uplift from AI in all of these domains science engineering decision-m I think the big question is whether we can channel enough of that newly unleashed creative power that we'll be getting into building the sort of things that maintains stability throughout. I'd rather get all the flourishing things out of like not super intelligent systems, but highly capable systems that I can coordinate with well and that can coordinate with each other well as opposed to training successor agents that we don't know how to train. There's no perfect security doesn't exist. >> Nora, welcome to the Future of Life Institute podcast. >> Thank you. >> Fantastic. Could you introduce yourself to the audience? >> Yeah, sure. So, my name is Nora Man. I currently work as a technical specialist at the advanced research and invention agency in the UK area and then I've been sort of broadly in the AI safety world AI like what how are we going to make AI go well world for the last couple of years um working on a few different things we'll probably touch on a few few of those. >> Yeah. Perfect. So you write to me that you think the strategic picture is that we are in a slow takeoff multipolar world. So if we kind of take that statement apart, what does it mean to that we're in a slow takeoff world? How do you measure that? How do what evidence do you see for it? Yeah, I mean just maybe clarifying what exactly I mean it it sort of comes in degrees but I think what I essentially what I mean is I expect there to be several inflection points of AI progress rather than sort of like one big one and sort of after that we have you know self-improve sort of self-improving AI going through the roof. So instead of having sort of one of those inflection points, I think we are seeing several inflection points and at each inflection point there's sort of a pickup of the ex the speed of acceleration itself and so that will still be fast but as sort of in in the lingo of the broader space I think we would sort of refer to that as you know slow takeoff rather than sort of fume kind of takeoff. And then what evidence do I think we see? Um well one is that well we see these models improving but I don't think we see them improving fast and increasingly faster but like not from one training run to the next. Um they them sort of achieving full full generality and especially sort of like online learning. Um so so that that's what we see so far and then I think there's also another aspect beyond the sort of model capabilities. there's sort of like how those capabilities translate into things that happen in the world. So like economic impacts um adoption and I think there like sometimes people sort of talk about like on hobbling or there's just sort of like in terms of how can we actually make use of those capabilities that um that in as well like is taking time and people are figuring out how to best sort of get the capabilities out of these systems and also adopt them into the world and there are just various you know there's various bottlenecks to this rather than a single one. So even if we unblock one bottleneck, there's a few others. Um, and while this will be picking up increasingly, there's sort of not a singular inflection point. >> Yeah, that makes sense. In terms of years, what does it mean that we're in a slow takeoff world? Because it can be somewhat misleading to say a slow takeoff world if Yeah. >> Yeah. I mean, obviously uncertainty for sure. I think sort of what would I say? I think my best guess or median guess here is that we'll see by the end of 2026 um pretty intense uplift in terms of software engineering in particular. So like I'm thinking especially about sort of the meter work on the task horizons that AI systems are able to pursue autonomously and that task horizon seems to be doubling every four to seven months and if you follow that trends you will have AI systems that can do in software engineering tasks of a sort of dayong uh dayong length. So like what would take a human a day to do um they would be able to automate by roughly the end of like roughly fall end of 2026. And so that would obviously then give a give a strong uplift to software engineering and ML research to some extent. And then that will you know get us to find new algorithms for training and eventually that will you know have these AI systems be able to do R&D on on hardware as well. So, so, so that so I expect it to go relatively quick after that. I think um by you know 28 2028 2030 I think these systems will be pretty autonomously capable of a wide range of R&D and I'm sort of yeah you know definitions are tricky but I think in I think probably more in terms of what's the window of intervention here to like steer the direction of that takeoff and I think that is sort of mainly located within the next two to four years and then I you know whether we call it AGI or ASI or you know I'm sort of like don't care too much but when do we have time to intervene I think the next two years to four years will be pretty seem to be pretty path defining >> and why do you think that >> so I I think of this often as sort of like differential progress so like a technological progress so there's a sense of like okay like how fast will this go but I'm also interested in like what direction is this taking off and um if increasingly these systems become more capable and more capable of autonomous R&D, autonomous work more generally. It will be those systems doing more and more of that and they can do this much quicker. They can do this in parallel. There can be big fleets of these systems doing this. So sort of the volume of what will be happening will be increasing. It will be a speed up and there will be harder for humans to intervene at that point. Um unless we already built like good structures that allow us to do that. So I think setting that up on the right trajectory given how fast the speed up is seems to be sort of lying roughly in that time period. >> I'm assuming that that we will get more capable. So us as humans we will get more capable because we will have these AI tools that enable us to do more work and so if that happens would that extend the window that we have to intervene just because we would also get more powerful as the AI become more powerful. Yes. And I think that's exactly the sort of differential acceleration that I think is critical to get on the way over the next few years. So if in fact we use the next few years well then it will be true that like human AI teams will be collectively very capable. If we don't do that as well, I think there might be a more heavy shift towards like AI is autonomously doing that um without without us having like meaningful oversight or steering or involvement in that. >> And you think the path of kind of minimal oversight or not much ability to steer the systems, that's what's happening by default. >> Yeah, it sure seems a lot of naturally happening. I mean, it's always tricky. I'm hesitating a little bit because, you know, we're all actors in the world creating the future. So, saying that that's my default prediction, you know, depends on what I'm conditioning on, right? Mo mostly what I'm saying is like it seems really really valuable to put a lot of effort right now into making these systems more steerable and building scalable oversight mechanisms. >> You're also right to me that we are on a narrow path where we can sort of fail in two different ways. We could either fail through domination or through chaos. So maybe you could sketch out what those two failure modes look like. >> Yeah. Yeah. I mean I want to, you know, those are a bit terms of art. Um, let me unpack. I'm not sure there are the best words at the end of the day. But I guess when I think about this future and like okay cool like these capabilities keep improving like AI uplift on a bunch of things we want to do um will become in increasingly significant. And I think for for me the big question is like cool how do we make that go well right like I can see that go many different ways how do we make that go well and what would it look like for it to go well and when I think through that there's sort of like two clusters of scenarios that I'm like worried about and and you know importantly those are clusters that break into many many scenarios but roughly the clusters I guess you could sort of imagine here is one is the sort of more classical rogue ASI loss of control scen scenarios where AI AI capabilities are improving and eventually we have sort of an AI system or maybe a sort of collection of AI systems that are more capable more powerful than humans and misaligned what exactly you know whatever that exactly that means but sort of out of control and don't care about humanity in such a way that that could cause a lot of harm and that would sort of be a scenario where the harm comes from sort of a single system that is accumulating getting a lot of power and is sort of dominating other actors that maybe also, you know, have their interests. So that's what I maybe call the cluster of sort of like failure through domination. And there's nuances to this. You could imagine there's some other people that are sort of interested in AI enabled coups for example. So like what if the AI system is maybe quite aligned with like a human individual or like a small group of human individual but that those individuals are sort of making a coup for power and then they are effectively sort of having all the power and dominating the rest of the world and that that too could be a would be a bad outcome and that too sort of falls maybe into this like gestalt of like um failure through domination. So that would be that would be sort of one cluster of fail I'm like oh we got to be careful about that sort of thing. Mhm. >> And then I think the other one that stands out for me is sort of a bit more diffuse and looks more like cool, let's assume there's not going to be a single system or a single set of actors that really accumulate most of the power enough so to sort of dominate and everyone else. And by domination here, I mean something like um powerful enough that you can just impose your will on others. And so let's assume that there there's not a single actor set of actors that can do that, but instead we're maybe in a world where everyone is very similarly um is very similarly capable, but we're sort of locked in somehow in this like race to the button where um is very adversarial and everyone is trying to win out over the other. And what you might end up is sort of like a setup of like excessive decentralization, excessive competition. such that every individual agent because they can't coordinate is sort of like locally always incentivized to sort of spend the last the last you know extra resources they have on increasing their fitness or increasing their power increasing their defenses so much so that as a as a collective outcome no one can sort of spend any of their resources on anything else that they might care about. So we have sort of an erosion of yeah values other than just winning out this competition and that also seems sort of bad and it seems it seems like an opposite cluster because that cluster looks more like there's no no single sort of dominating power but there sort of like all out competition and that eroding away what we care about. So so in that picture with sort of these two clusters I'm like oo you know I can tell stories about how AI is going to accelerate both of these stories. um both of them don't sound that exciting. What does it look like to sort of, you know, scoot past both of them? Like how do we how do we sort of thread the needle where we don't end up in either of those worlds? And then what would the world look like that stably doesn't collapse into one of those one of one or the other of those clusters? >> Yeah. The scenario in which there's sort of excessive competition. Uh that's that's a scenario you write about with co-authors in in gradual disempowerment, a paper I I'll link in the description. And listeners can also scroll back in the feed and listen to my interview with David Duvenu on that topic. But just quickly on the competition question here, competition seems to be a positive force in the world or has been historically. So So what is it that's changing with AI that's making it into a sort of a destructive force? So you can it can definitely be like I agree with the premise like a competition can definitely be constructive in many ways. But I think the to put it succinctly I think that the challenge with competition is basically how can you deal collectively with negative externalities. And so negative externalities being like you know effects of your actions individuals or collectively that are sort of bad but like no one is pricing in or like there's no in the incentives or such that like no one no individual actor is pricing them into their actions so they get sort of ignored you know classical examples pollution is a negative externalities and we're sort of not pricing the downstream cost of this insufficiently and then actually everyone is kind of worse off as a result but it's like hard to co coordinate on not doing this and I think excessive technological risk as well is a negative externality that is like hard for a group of actors to coordinate on managing appropriately like yeah you there's a lot of upside that we can get out of for example the technological development of AI but there's risks right and then how do we collectively coordinate on managing those risks so so these are examples of negative externalities and if you're in a context I think my my picture here would be something Like if you're in a context where it's all out of competition and you you can't for whatever reason those various actors in this multiple world can't coordinate those negative externalities will sort of pile up and there's like no way to coordinate around those not happening and then I think in the world that we have especially in the world where [snorts] we're in some sense so powerful right like I think the moment then like humanity invented nuclear um nuclear [clears throat] power was sort of a big deal because it was maybe in you In some sense, the first time that humanity was able to create negative externalities so big that we could really honestly not like properly not recover from them, which is very different than externalities that are recoverable. >> You're thinking of nuclear weapons, correct? >> Yes. Yeah. >> Yeah. Yeah. >> Yeah. So, if there if you have negative externalities, then you can't recover from as as a collective you really want to coordinate on like not incurring these externalities. But if you have not if you don't have the means of coordinating on that you collectively can't steer away from them. >> Yeah. And one worry here would also be that the societies or companies or individuals that thrive in a world with a lot of competition are exactly those that are kind of set up to to be maximally competitive and to not care about anything except for getting more resources and sort of winning. And so you end up in a world where everything is optimized for gathering the most resources and using those resources to sort of find further resources and so on with not much left over for say human flourishing for example. >> Yeah. Exactly. >> So one question I have here is that is there sort of a a tragic tension between avoiding these two outcomes? Is it the case that trying to prevent one of these outcomes is to the detriment of trying to prevent the other one? >> Yeah, I mean uh that it well it sure seems tricky right from where we're standing. I and and this is why you know this is why I sort of like use this this notion of like a narrow path or like threading a needle like it's it's in some sense interesting to observe that there's a way in which those failure modes seem like quite the polar opposite of each other and at the same time from where I'm standing right now with sort of like partial observability over like the world and and what's going on and what will happen. I'm like, "Yeah, but I'm still worried about both of them." Like, both of them seem kind of somewhat serious um plausible outcomes, even if they if they seem like at polar ends of the spectrum, what could happen. So, so, so that's interesting. And then I would just say I think it's sort of unhelpful to frame it as which is, you know, which of those bad outcomes is the is the better one? So we should I'm like no the problem is just stated correctly as we need to like navigate both of those challenges and then what does that look like rather than trying to you know trying to make a hierarchy of which one is worse. So yeah I'm I'm I'm worried about both of them. I think we need to look out for both of them and threading the needle is in fact hard. >> Yeah. So let's get into your sort of proposed solution for threading the needle here which is around building coalitions specifically coalitions between humans and AIs. um what is it that we need from those coalitions? Which sort of capabilities would they have? >> Uh so there's a few different things that I Okay, like here is one way I'm like AI is going to, you know, AI is happening. [snorts] >> Mhm. >> AI is improving. We are finding more ways of feeding it into things we do and getting benefit out of that. Um so like AI you know people kindly very excited about sort of AI enabled science like what if like AI can sort of like um you know really supercharge how we do science. Uh what if it you know it already we have a lot of example of how it supercharges us doing software engineering and coding that can like increasingly feed into other areas of engineering and R&D. we're thinking about like how can AI help us, you know, navigate the complex world, make good decisions and and partially I'm like, well, that seems to be happening. How do we make that go well? And what would that even look like? What what does good look like here? And I think a few of the pieces that we need are we want to be able to leverage AI system even though we maybe can't trust the system 100%. Um or maybe shouldn't trust the system as 100%. And I want to say like there's there's a lot of reasons for why that might be the case, right? Um a bunch of people listening to this podcast might not want to trust AI systems just sort of like fullon because uh we're worried about safety uh safety concerns, right? like about deception, about misalignment seems legit, seems important, but there's also I just want to like there's more mundane ways for why we might not want to trust AI systems as they get better. They still make mistakes or it actually can sometimes be hard to tell the system precisely what you want. So, you know, I might say like, "Oh, build that website for me." And then it it does that and it tries really hard and maybe is sort of like really well intended in trying to do that but I actually kind of didn't specify what security properties I want from from the back end etc. So we might also just like be ambiguous about what we want and make it hard for the system to to do what we want given that or there's ways in which the system you know I individually might trust what the system does but like as a collective like as a society um it matters that we all feel like we have justified trust and that's like not the same problem. So there's a few different reasons for why we can't just trust that the AI just like you know close our eyes and like the eye will will do all the things we want um in science in engineering. So instead I think we need to build oversight me like build workflows tooling infrastructure that allows us to without needing to fully blindly trust the system trust the outputs it creates. So if we get AI to start to do science for us, if we get AI to start to do engineering for us, I think the crucial question is like how can we arrive at justified confidence in those outputs and if we do find ways to do that scalably, then I think we are in a good position to build a world that unlocks a lot of flourishing um in in you know in this sort of in this coalitiony way um where like um AI and and us can like work together with building cool things and then the question is like what sort of things should we be building and I guess like a lot of things right like the lycom is pretty big there there's a lot that could be built but going back to what we talked about earlier it's bit like notion of negative externalities I think one of the things we need to make sure such that you know we can continue to expand into litecoin and do cool and fun things we need to make sure that we like manage negative externalities right in particular sort of big um big ones and I think some of The stuff that is needed for that are public goods like we need to build a resilient infrastructure. So such that like you know yes accidents might occur yes there might even be sort of adversarial dynamics going on but we need to be resilient enough that they're not that we can sort of recover from them or work with them or adapt to them and we need to build the sort of R&D that allows us to do that and on the other hand I think we also need to build out the infrastructure that enables sort of positive positive sum trade right like agents by working together can sort of unlock more than they could do alone which which is really exciting and I think there's basically the poss like that here is the possibility for coordination but there's some challenges to actually being able to answer these trades because um and and you know that's sort of like economic theory bargaining theory curs bargaining theory like there's challenges like even finding these opportunities of collaborating there's things like sort of security or enforcement related risk where I'm like cool like you have something I want and I have something you want that's great like let's just do it that what if what if I have to worry that if I gave you know I gave you the thing that I had but then midway through the trade you decide to you know change your mind and walk away. I'm like well if I face that risk now my sort of expected value estimation of whether I should engage in this trade becomes very different and now it becomes an obstacle to entering what would have been a positive sum trade. So I think building out these sort of public goods that allows various coalitions of actors to emerge and find positive say positive some trade interactions would allow us to to navigate maybe a rich world of and and navigate the ex the negative externalities both on the domination and on the you know chaos or excessive competition side. >> Yeah. So, so this vision starts with us being able to trust AIs that are not fully trustworthy for us to build these coalitions between us and AIS. How is that work coming along? I I know you've you've been working on it and and others have too. So, where are we and and could you perhaps kind of explain how this might work? >> Yeah. Yeah. I mean the there's many way I think there's I want to say that there's a lot of folks working on aspects of this problem and a lot of different ways you might achieve this. So and I'm not going to name all of this. So but just to acknowledge that and there's sort of classical alignment work right like that that like frontier companies are doing other places are doing that can feed into this. There's interpretability work where we're like trying to just better understand what AI systems are and how they behave and that can help here as well. But I think the the thing that I'm more working on is I often try I often sort of summarize it at the moment as like trying to build the tooling that allows us to do high assurance AI enabled R&D and by high assurance I mean have AI do a lot of R&D you know software hardware various types of R&D for us but in a way where it can actually give us really pretty rigorous quantitative asurances. is about what the outputs of that process is. So that's the safeguard AI program that I work on at Arya is doing that or is aiming at that to some extent and I mean we are a bit more than a year in doing that and you know it's ambitious that that's sort of like the design uh by design meant to be very ambitious and we'll see how it goes. I think so far I'm I'm excited about what's happening and I think we'll know we'll probably know much more about sort of next summer whether some of the initial big bets are panning out. >> Let let me just let me just try to give an example of of how I see this problem. So, so we can imagine that we have a superhuman coder AI uh in in a couple of years and it gives us an a piece of code that's way too complex for us to understand and then it's a question of you know the AI tells us that this code is very helpful to the projects that we're trying to achieve. Should we run this code right is is running this code dangerous? And I'm I'm this this picture is a little naive but I'm like the question is should we run this piece of code? Then one thing that you [clears throat] can do and this is what I sort of my explanation of what you're working on is try to have the AI provide assurances of what will happen when you run the code and so yeah you can explain exactly how that's done but the output here or the kind of desired end state would be that you can say with confidence when I run this piece of code I will such and such such will happen and I will get this outcome in the world. Um yeah, is that a is that an accurate sort of sketch of what's happening? And and yeah, >> roughly I mean I think it would be maybe it's like slightly too ambitious. I think if it's literally uh and here's all the things that would happen when you run this code that's maybe not exactly um not exactly right or could be misunderstood. I think maybe the way I would put it is um cool. So we want our AI to produce some code for us and you know one way and but then you know there's there's specific properties we want from that code like maybe in in say it's you know we want to use this code in critical infrastructure. So we're really interested for this code to don't have not have exploitable bugs. Now, one way we could do this is by asking our AI agent to write that code, you know, and to tell it to write it securely and then maybe even later on to sort of like try to review the code and then see whether there's any we find any bugs. But the problem with that is a it's actually pretty hard to review code and find all the bugs that are in it. And secondly, it's not scalable, right? If we actually really buy into the thesis that AI is going to do more and more of that work and can do that much faster, it will you know the future in the future AI will produce volumes of code that like humans could never keep up with reviewing. So that doesn't seem like a very good bet. So what if instead in zumans the the thesis of the program is like what if instead we made the AI do some extra work for us to sort of show us in a way that is much more easily reviewable for us that the code actually has the property that we want. So in this case in part like specifically we're interested for the AI to provide proofs that this you know bit of code that is given to us meets particular specifications uh that we have given to it and to make that scalable you know we need to build some inference some tooling that a makes it easy for us to specify what we want that's not an easy that's not an easy task but if we have good toolings here we can use that sort of across the board and then we need to give the AI sort of additional affordances to get good at giving us proofs that the outputs it produces for us do in fact meet the specifications given and yeah I think in the so so that would be sort of an example of like scalable oversight in this sense >> and so you're asking the AIS to do more work uh what what's your sense of the sort of performance tax you're paying for doing that versus just you know just asking the AI to produce the code and then crossing your fingers and hoping it works. uh I guess that that might be cheaper but yeah what's the performance difference between those two things >> so it will vary uh on on what you do right like if you a proof about a piece of software will be much cheaper than a proof about like some cyber physical control system for example because it's it's a much more complex world model and specification that you have to prove against so it will vary but I think the the main point I want to make here is that you know boldly I want to say it doesn't matter that much because if you're the principal right you're you're a human uh a company whatever and you're like great right we have these AIs and they can really uplift how much we can do but then like those AIs end up doing something and you're like no idea what that is and you can't keep up with like making sense of what that is effectively that's just not useful at all it's not like a little bit useful it's like basically not useful so I think there is a bunch of like natural incentives to be like yeah like obviously I want the work to be done Well, but yes, it is also very important that we build the infrastructure and the tooling that makes using those workflows easy and and scalable and and this is why I sort of think of this as sort of a public good that we can provide that that would then sort of change the trajectory of like h how we using AI systems going forward. I think there's more details to be said here around even like I think you can get pretty nitty-gritty around how scaling and compute costs pan out and you know it's to be seen and there's that's really in it's it's really pretty detailed but I think one advantage that this like scalable oversight infrastructure has that we're building out is that it also makes it easier to use fleets of agents um sort of in parallel but in a way that actually works well. Because you know like if if you want to paralyze your work, you need to like have good ways of carving up that work and you need to have good ways for these agents to coordinate with each other without you know being able to like communicate freely or or having all all the shared knowledge >> and and why why can't they just communicate freely and and share knowledge because that would be dangerous to you or or why is that? >> So you you need to build harnesses that that that do that. You need to sort of figure out how they would be sharing sharing knowledge like and and and you know they couldn't in some ways but you still need to sort of carve up if you want to work on two things in parallel right what if one finishes before the other should that way what if like they try you know one agent tried X and then it actually failed does that now mean my plan doesn't work out anymore and I have to sort of reach or is that still you know sort of a way of like how do you carve up your like technical road map of building this such that you can affect effectively or efficiently parallelize. And that's like an actual problem that's like not that easy. And and then there's sort of an another scaling law here which like how well can you you know I could use the same amount of compute to like run one model that's bigger and more competent or I could use 20 20 agents in parallel and depending on how effectively I can use them in parallel one or the other might you know might give you better outcomes. So there's an interesting scaling law going on here that we'll see how exactly it pans out. But I think um yeah, given some recent data I've seen on uh or else scaling, it seems plausibly uh it plausibly means that it keeps you it keeps you relatively competitive to to be able to paralyze across agents very well. >> Interesting. Does Arya have a working prototypes of any of these things? Have you sort of do you have a toy example of of how this looks like? So I mean there's sort of different parts to to the entire tool set. So there's not an overarching single working prototype. There there's definitely examples. I mean folks are using using workflows that look a bit like combining coding agents with SMT solvers or something to to get proofs about your code. And then we have like some some pro some early prototypes in the world modeling domain. But yeah, we're still definitely early on and there isn't I can't I can't point you at a nice website where you can play around early stage. But you know, maybe by by summer we have more that looks like that. >> So making agents collaborate. This seems like uh it would be a breakthrough in in capabilities also. Are you sort of playing with fire here in a in a sense? Is this something you you think the frontier companies are also trying to solve? And I guess what why is it important from the safety perspective to to make sure that the agents can collaborate? >> I guess one way of sort of describing this is that we will get massive uplift from AI in all of these domains. science, engineering, decision-m I think the big question is whether we can channel enough of that, you know, newly unleashed creative power that we'll be getting or that that will sort of be unleashed into the world into building the sort of things that maintains stability throughout, right? like that build resilience quick enough while you know enable science will also find dual use pieces uh you know discoveries and so we need to build a I think I think basically in order to make things go well here we need to build a bunch of tech and a bunch of solutions and technological and institutional solutions that help us navigate those various trade-offs so I think in part this is a bet to be like how can we differentially and first accelerate the building of those resilience or stability increasing technologies soon fast enough um such that on that we can create and maintain those sort of islands or those coalitions of of positive sum trade and and and protect some of those things we care about. >> Uh so if you're if we're successful in in trying to make AI agents collaborate with each other on solving a problem, does that help us build these human AI coalitions? because collaborating with an AI agent is somewhat of the same thing as collaborating collaborating with a human. Is that the angle you're taking here? >> Uh the they need they need they're partially similar and and some of this some of the tooling that is needed here is the same and then there's also slightly different things. So yeah, like we want that like human AI link to be sound and we also separately you know like we'll I think we'll be in a world with you know plenty of AI floating around the economy and most of the e economizing that that's going to be happening will be AI agents doing it often sort of in in in maybe the uh sort of representing humans in some ways or representing companies or something in some ways but so yeah they need some technical solutions to be able to make bar bargain with each other and bargain organ in in our set. So, you know, I might have um some agent that will be scheduling all the meetings for me in the future and and understand something about what my preferences are and also sometimes come back to me and be like, you know, how do you feel about the scheduling? So, we need that that human AI link and we also need like for those agents to be able to go out and coordinate with each other. Sorry, now I lost your question though. I'm thinking of I'm thinking of where the safety angle is on making a AI agents collaborate. It it could also be that having agents collaborate to solve a problem is safer because the individual agents are not as smart, but they can produce an outcome that is collectively high quality and in a way where we have more insight into what's happening because we might be able to sort of it's not one agent that's incredibly smart just giving us an output. It's sort of a hierarchy that we might be able to look into. >> That's right. And then maybe a an even more Yeah. an additional consideration here is something like so I think we face serious risks or like negative externalities as we've been calling them from like building ever more capable AI systems or like building systems that sort of really self-improve themselves and because we don't know how to make that you know how to do that in a way that maintains alignment reliably. we don't have that science of like how to how to safely self-improve your source codes but so so I think that's a risk and I think we need to navigate that now interestingly an AI system that is pretty capable and maybe capable of doing like ML engineering and they could sort of train the next iteration of more capable agents they they're in some sense also facing a ambitious alignment problem right like how should they like how should they be designing or training the next succession of a successor agent in a way that like protects some of the key things that that agent cares about and you know there's always some world where like we've you know discover the science of doing that and then we sort of know how to do that but um [clears throat] I'm not I'm not quite sure whether metaphysically that's quite that's entirely coherent or what that means but if we don't then that agent might have a choice between training a successor agent that they don't can't control or they can't align um with high confidence or or they maybe alternatively if they have the option if they could use tooling and use infrastructure and use coalitions and teams to as as a collective be more capable or sort of become more capable through tooling rather than through self-improvement that might become more attractive. So yeah, that's a sort of like we're maybe facing that situation, right? like I I'd rather get all the good things, all the flourishing things out of like not super intelligent systems but highly capable systems that I can coordinate with well and that can coordinate with each other well and then those AI systems themselves also might rather build flourishing coalitions that can you know get us a lot of exciting things as opposed to training successor agents that we don't know how to train well until we do like maybe at some point we do I don't know but um so that there's There's some gesturing here at like a where some of that stability might come from. >> Mhm. And these are the types of things that you work on like scaffolding or tooling or infrastructure. Could you give some examples of what what would to what what is tooling? What is scaffolding? What is infrastructure? Why does this allow agents to sort of improve in a perhaps more safe way? >> Yeah, there there's a couple of different bits. So one is that one that we refer to as world modeling. So this is basically how do you specify what you want? Um in software this is like somewhat easier right you can have um you know you can write certain sort of memory safe properties or whatever you can sort of specify them formally relatively easily once you're thinking about cyber physical system then specifying what you want needs you to have requires you to have like a some sort of model of the world and and and some sort of causal structure of how the world works. So yeah, one one bit of tooling is like yeah, create the tooling that allows us to specify what we want and do that in a scalable way and sort of that's sort of the world modeling part. So that's a bit of tooling. There's another bit of tooling which is more about well once you have a spec and and like how do you scalably take that spec and produce a solution to the problem that was specified to you alongside a a proof certificate that sort of asserts with high confidence that that solution actually meets those so there's also sort of like how do we you know make that easily paralyzable how do we make that go well um so so giving AI's tooling to to to do that is another thing we want to do. >> Yeah. Before we move on, um just what does this concretely look like? So this these are pieces of software. Are they should we or should I listeners think of these tools as a sort of like a word program or an Excel program like [laughter] again I'm I'm just trying to understand exactly what's going on kind of is it is it something on the screen? Are you inputting your your sort of specified values into a text box? What does this look like concretely? >> So for the AI agents, the the term harnesses is these days used a bunch in terms of like what can we give to an AI agent to help them in some sense be more capable like do more things and we give them harnesses to like guide them allow them to sort of be coherent over longer time horizons to use other tools you know whether that's like searching stuff on the internet or using a SMT solver or something like that. So I would broadly think of this all as like building fancy harnesses that allows you know AI systems to help us build world models and then and then you know write output and proofs about that and then there's some elements as well of like cool like which bits here are really critical for humans to understand and review. So in this case it might be um humans should be reviewing the the the specs right like the place where we tell the agents what we want um against which we were getting guarantees. So you know there's sort of UIX UI questions about like how do we best do that um uh that like humans can sort of co-create these specifications. So, how how is the spec different from a prompt? Say, when when when if you're just sort of a regular user of AI, you will you'll prompt the system, try to get it to do something for you. What you're doing with a spec is is more advanced than that. Is it sort of expressed? Yeah, it's it's not expressed in in in pure text, but yeah, what what is a spec and how is it different from a prompt? >> Yeah, I mean, a thing that we are aiming at is to have formal specifications. So, that's you can just think of it as mathematics. you want to sort of mathematically specify what you want. Um so this could be some sort of like temporal logic. Um here is some states you know here are some states in the state space that we don't want to end up in. And then the and then basically what you ask the system is to give you some sort of assurance that like indeed you know we while optimizing for the goal and trying to like get you know get as much of that or get as good at as possible I can guarantee you that we're not ending up in in those parts of this state space that we have specified. So that letter part is what we well the specs can be the safety specs where you say this is where we don't this is the things this is all the things we don't want to happen. Um, and then you can also specify like what's the goal like what what's my utility? What do I care about in on top of that? >> Yeah. And when you say guarantee, this is this is stronger than a legal guarantee. This is a mathematical guarantee that's that something won't happen. >> Yes. I mean, there's there's some nuance here of like, you know, guarantee can actually mean a bunch of different things, right? It can be just like fully sort of logical. Yes. You know, I have a have a proof that 100% states that this thing is not going to happen. What we're more dealing with especially in sort of cyber physical system is like guarantees of like the likelihood that this is going to happen is like less than and then you you try to make that value value pretty small. You might even you could even yeah you could even get into sort of like asmtoic guarantees where you say you know in the limit of like spending spending effort on this or spending compute on this it will converge to to some value. So, so guarantees can come sort of in a range, but I would typically I would just say like let's talk about quantitive guarantees um as as sort of like a you know a relevant category that that is like a higher degree of rigor than uh the VI system just just telling you oh yeah yeah I did check the code in fact or stuff like that >> and we wouldn't see the same types of problems that we see when humans try to give guarantees you know there there's a bunch of examples in history of people saying this can't happen or this this is vanishingly unlikely and so on then it happens anyways perhaps because we're overconfident because perhaps because we don't have an accurate model of the world uh to what extent would AI face those same problems >> I mean I think in principle we face the same sort of problems like there's no there's no perfect security doesn't exist but I think the way I think about this is like we sort of have to sort of adopt the security mindset about like where do these where do these errors tend to creep in and then how can we in a sort of with a sort of layer defense approach try to reduce those. So a few places errors can come in here is like you've already mentioned this um we can have a wrong model of the world right and then if we have a a guarantee relative to that model of the world turns and that turns out to be wrong then like our guarantee isn't giving us what we thought it would give us. Now how can we you know how can we layer additional defenses on this? So one is like we can try to make the model iteratively the world model or the specification iter iteratively better. So before deployment or before we sort of deploy the solution we say you know can we sort of like adversely test first can we come up with scenarios where actually it turns out that the spec was not what we wanted it to be. So we can do that a bunch. We can have like runtime verification meaning we have we have a system that was synthesized to have certain guarantees against the world model. Let's let's deploy that. But as as it's sort of operating, let's observe what is happening the factor in the real world. And once we start to obs if we start to observe things that sort of conflict with what we thought was like plausible trajectories that is like evidence that like our role model actually some something was wrong right like we we made the wrong assumptions and our guarantees don't hold anymore and then you maybe want to you know have a failsafe structure which says oops that guarantee wasn't what I thought it was let's back up to something that's um much more conservative than that. So that would be sort of an additional layer of safety and then like exactly you know how many layers are appropriate and and how exactly you want to do this should depend on your context right like if I you know code up some personal website then I care about it less than uh if we if we're trying to secure you know the energy grit and make sure that no one can hack it. So it will be contextual and it should be layered because no single approach gives you gives you perfect safety. But I think the types of errors are pretty pretty similar. >> Yeah, I I [clears throat] can understand how you can have safety guarantees and sort of accurate models of very simple systems, very simple programs say, but as soon as you start interacting with with the real physical world, it seems like you would face some sort of explosion of complexity that means that you can't accurately model what's going on. If you're talking about a piece of code written by AI that's supposed to be plugged into the energy grid, that's a very very complex system uh and it it's a it's a sort of a system that's both physical and written in code uh digital. And yeah, how do you handle that? How is it going the project of trying to model different parts of the world? >> Yeah, I mean I think the the goal here is sort of the goal that we're pursuing is sort of dual. One is like um at the heart we're trying to build out the capabilities of doing these sort of things at scale and well and then also we're trying to like find ways of demonstrating checking and demonstrating whether whether that's on on particular examples um whether that that works. And in principle, yeah, we're not we're not um what we are giving assurances over is like say in the energy grid case. We might be interested in what is the control system that does the energy grid balancing of like cool like how much um supply is coming in, how much demand is going out. How do we like make sure enough energy is at the right time at the right place such that overall that that the property of of stability of the energy grid is maintained? Now this system is like not the whole world. This is like this is that control system and right now this is like a lot done manually um because because you know it's a pretty high stakes thing but doing it manually means like it's not very optimized and we have a lot of slack to just make sure we we're not having blackouts etc. And there's a lot of sort of safety layers and the proposition is basically just to say probably like you know AI in some sense like quite obviously lends itself to this problem because it's essentially an optimization problem but we don't just want to optimize in a sort of sort of classical methods because they out of distribution don't generalize well and the system might be seeing out of distribution occurrences. >> Yeah. And I think that's that's almost guaranteed right in the real world you will see all kinds of kind of one ina million things happen. >> Right. Right. So now you want the world model that's in some sense quite conservative that doesn't very narrowly say oh the history exactly you know that's exactly the demand pattern that we had this is why it should look roughly like that. You want to instead say much more conservative things like you know what are the physical laws that govern how fast energy can travel here and like at what and and what frequencies that implies and and what constraints there should be. So like that you know that's a set of equa differential equations that sort of govern the physics of of that um and and then you understand you know what what are the control knobs that I have available here and then you like feed in some amount of like cool like what do I know about what demand and supply tends to be and that's like where you sort of optimize but you build in enough conservatism here that like even to like relatively intense sort of spikes or surprises here you could still maintain you know, the frequency that will keep the energy grid stable. And then there's societal questions here about like how much buffer exactly do we want, right? Like do we are we okay with a with a blackout one in a 100 years, one in 20 years, one in a thousand years, right? Like and and then how does that trade off against the like cost efficiencies we get? But once you have a a a sort of formal mathemat world model that specifies sort of the plausible trajectory, not not the not just the probable trajectory that your system can take, but all the plausible trajectories, then at least you can sort of like make that question and that trade-off more explicit. You can say, okay, cool. like we'll annually, you know, save that much money and we're like willing to like have a black god once 100 years or I don't know what the what, you know, that that is a collective deliberation and social social question. >> Yeah. And then we would be much better informed about those questions than we than we currently are about the trade-offs between cost and performance and reliability of the systems that that we're relying on. Again, there's there's a question of of how much additional cost is imposed by us having to do this world modeling and these guarantees in as opposed to we can imagine just building like we have now a grid where we don't have these things. So, so what's do you have a sense of what's the additional cost there and is it would you would you think it's higher for when you're interacting with a physical system as opposed to a system that's only digital? >> Yeah. um I mean doing it for a cyber physical system is more complex than for just a cyber system. So, so it will be higher in some sense. But I think the question is sort of like what are the different considerations pulling on either direction and I think um one thing that we see is that AI adoption especially in these high stakes domains has actually been quite slow and and is really tricky like there's a lot of governments that are actually would be very interested in like you know increasing the efficiency of power grid balancing etc because the costs are really immense including you know bunch of other systems but the adoption problem is just like pretty significant And so maybe having this like slightly more this this way of of doing this in a way that has maybe more uh that looks like has initial more initial cost. Um if that if that's the only way that you get to the level of confidence that like we are actually as a society happy to adopt these systems into high stakes environment then like the you know kind of actually the cost isn't actually meaningful at all. So that could that could be one consideration and then obviously society has a you know right now a tricky time to like prize in the all the catastrophes that didn't happen and again maybe sort of world modeling collective sense making tech AI enabled can help us make those trade-offs better because now we have like more meaningful guesses about what these numbers actually are and then again this will be sort of a way of finding positive state positive some traits um and then lastly I think that one thing that's like worth flagging is that a lot of this seems like upfront costs that that actually going to amortize relatively quickly later on. So once you have a world model of the energy grid of a specific energy grid, you don't need to like every time redo that you might update it over time etc. But you sort of have that you have that set of specifications that can be reused and that's true in a bunch of different domains. So you might also see some sort of amortization effect happening. >> Interesting. Interesting. So there's the problem of specifying of of getting accurate world models, specifying what you want and then having guarantees that that you're getting what you want measured against the world model. Then there's the whole separate problem of sort of selling this to people in power, policy makers, the general public. And because it's not simple to understand what it what the goal is here and then how the technical details work out. Um how do you how do you sort of present this in a way that's both accurate but also sort of enticing to to to people in power and people who might have have the yeah the power to implement some something like this. >> Yeah. Yeah, I mean so that's I mean very context specific right like in a different in a specific country like who who is doing the like energy grid management or something right like there there's definitely that I I think um and then in other cases other elements of R&D like all you you know all you need to you you need to convince is like some startup founder that's like oh that that tool is actually great I I want to use this and this is how I like built I'm now going to like use AI systems to like help me build stuff. So the who you need to convince varies a lot and I think that convict convincing folks by like having compelling demonstrations is sort of the way to go. Um sort of in this like more bottomup bottom up world. Um I think in the higher stakes like critical infrastructure cases the barrier to adoption is already high and I think there the interesting question is like how well can we sort of like communicate that that this could be a good solution in the language that is already being adopted and in these high stakes environments the notion of safety cases is actually a pretty pretty common terminology and I think you can sort of cash out this entire setup as how do I how do I give you a safety case. >> And what is a safety case? >> A safety case, I think really you can just think of as like what's your argument? What's your like careful carefully presented argument for why the system is in fact safe? So if you're telling me you you've you've generated this like this like um neural control system for balance in the power grid, why should I, you know, regulator or whoever or energy grid operator, why should I trust the system? Like what are your arguments? And then you walk through, right? like cool like here is you know here is how we like set up the world modeling and like had an had an AI system help us generate this control system here's the proofs we got and then the proofs are sort of part of the the the safety case but there's alo a bunch of arguments around it so for example how does your runtime verification work and like what iterations have you gone through as you like developed the specifications and like how how much have you reviewed and tested uh those so there's sort of the entire package that like forms a safety case And that's sort of like the safety case is the language that is currently used. And I think this is sort of an easy way to translate it into that language. So you know like on the ground it will be it will be tricky but like not necessarily more tricky than it already is to get cuz you get AI based solutions adopted. >> Yeah. Let's talk about AI resilience as sort of broadening out. So we've mentioned the we've talked about the energy grid on a from a sort of bird's eye view. What do we need in order for the future with with advanced AI to be more resilient? >> Yeah, I think we need a bunch of things which is also why we need good AI enabled R&D processes to help us build these things faster. But just like zooming out a little bit or trying to like set this up a bit. Um resilience, but I the reason I like the term resilience is that it's it doesn't mean like anything bad ever happen nothing bad ever happens. It instead means like nothing that's like irre um irreoverably bad ever happens. Like we would like to play g keep playing this game. We like to maintain you know the core functions that like make society run the way it does currently or like you know makes a human body run the way it does or whatever you know a healthy system is a healthy functioning system. We want to maintain these core functions and actually you know we can something can go wrong and we can adapt to it and we can recover and like there sort of within some bounces is fine but we want to be really resilience against these like outlier hazards that could happen and I think the interesting thing here is to say something like cool like I mean you know evidently humanity so far has been resilient because we're still here and functioning so we have built out some resilience in various ways but given the advent of AI, a lot will be changing, including kind of like what it looks like for a civilization to be resilient and sort of the risk profile of that is changing. So I think we should just there's sort of this exciting premise of like okay let's think through what resilience would look like in the age of AI and then and I sort of when you if you try to do this relatively structuredly um you could be sort of like cool like what are the key what are key functions that we need um what are key attack vectors that like where we have vulnerabilities that we need to make more resilient and you could you know you could say let's let's try to list them right like we have cyber systems that like a lot of our civilization is like running on right like from the energy grid to the internet to the financial system etc. We have physical systems that we need as well. We have institutions like socioeconomic institutions that we sort of need to work in certain ways. we have like human psychology or sort of human cognition and also collective sense making collective coordination and AI will sort of is introducing new dynamics in all of these domains and then there's sort of this like interesting task of sort of strategic foresight of saying how exactly will AI change these domains and what we need to do that like they remain within the bounds of like what good and healthy functioning looks like. Yeah, I think what the COVID 19 pandemic showed is that we have as a world not been optimizing for resilience. So it kind of showed that that we for example in manufacturing there's been a lot of optimizing for reducing sort of storage costs. So doing just in time manufacturing that means that we can't that we couldn't manufacture many things we needed during the pandemic. Same with kind of where we are manufacturing critical things we need during a pandemic. Is this sort of an example that we should be thinking about when we are thinking about resilience? This is something that could have been foreseen. It's not historically unprecedented and it it's it broke a bunch of the systems that we rely on to to function in the world. >> Yeah, totally. And then I mean you know I think it's like worth thinking about this quite holistically like yes in a sense we didn't have some stock paths that we should have had or we didn't um you know we hadn't streamline certain vaccine appro approval processes as much as like we could have had or something. So so so we could build some of this so we could now stockpile stuff. But the there's sort of a metro problem here which is like as societies a thing that we see is that like just after a catastrophe we're like cool yeah we're willing to pay this extra cost to do this but like couple of years in we're like why again are we paying this like we should just not be paying this. So there's there's also sort of that collective sense making approach here of like yeah how do we coordinate on like managing these negative externalities. So that in itself could be like a way to strengthen resilience. So there's lots to do which is kind of exciting but yeah like I mean I've done some work just trying to sort of start out at a relatively high level and you know in particularly in bio security I'm like not an expert and I I I'm always very impressed by the experts that we do have in terms of really analyzing who like what are the sort of highest leverage interventions we could do and then you know in in cyber it looks different again I think like with the advent of like AI systems that are really capable of coding and mixing that with formal methods. I think the promise of being able to write especially in high stakes domains code that is just like formally meet certain specifications and you know getting rid of the exploits that that could that are there in the first place rather than just investing in like being able to find and patch them quicker like there's sort of like a a promise of like making the system like very defense favored that I think that's sort of the really exciting things that we I think should be aiming for and should be very ambitious to to get to as soon as possible. What what would be examples of systems that are sort of favoring the defense? >> Well, I think I I mean I think of defense favored as like how can we build out the sociotechnical stack such that civilization is like as defense favored as possible. Um I think I like the example of like formally verified code just to illustrate this or like what do I mean by defense favored roughly? I mean that if you sort of pour more resources into both offense and defense, offense will sort of sorry, defense will systematically out compete, right? Which is much better than if you're in a situation where like you know you need to put in twice and three and four and five times as much resources into the defense in order to keep up. So as much as possible, if we can set up the structure such that a little extra input into defense allows you to sort of defend against much more um offensive uh resource input, that that's really interesting. And then I think we should then for each of the critical domains try to envision how could we get society into a defense fair position here. Um, and I think that could be very stabilizing as we go through this on the face of it, potentially pretty destabilizing AI transition. >> One example might be preventing cyber attacks where you could have AI agents on both sides of that equation, both trying to defend a system and trying to attack a system. Are there any good options here for for making it such that the defenders have an advantage? >> Yeah, I mean, you can write code that just doesn't have exploits. So, so there's just like nothing for an offender to do. Now, this isn't like this isn't as easy as that might have sounded or something, right? Like um there are side channel attacks. There's like you thought you didn't have a bug and you still had one. But there there is in fact a way like in principle there is a way to write a code that just doesn't have those exploitable bugs and we have you know some history of side channel attacks and stuff to to draw on to to be able to make systems like this this that are highly secure. I think cyber really among among all of the defense the domains I think might be the most easily defense favored. >> This is interesting because when I've I've interviewed sort of um IT security experts, it's it's often the case that they are depressed or pessimistic about ever creating safe systems because it's so difficult to find out all the ways in which these systems could fail or might be able to be attacked. Is this something that's changing with AI where with AI we are more we're better able to create secure code in a way that's scalable to systems that are actually useful and not just secure but can't do much. >> Yes. Yeah. I think it it does. So a few things to say here. I I think one of the key points to make here is something like we have examples of very secure code. So one one example I like to give is DARPA had this like heckhams program number of years back where they try to that where they one of the things that came out of that program is a formally verified micro kernel that they used and they tested in sort of like pretty wild scenarios. So I think they used those systems in in like a manned helicopter and I think a quadcopter and they had red teamers and those red teamers knew everything about the development process like they they had full access of like how the system was built, how the software was built. They also had access they got access to the system. I think they got they were on the I don't know I think they got access to maybe the like music playing system or the the camera or something from the start and then from that position they were like cool like now try to hack the rest of this this helicopter and take it down but they didn't succeed to do that because SL4 was able to sort of give an isolation around that system that they couldn't break out of because you they used formal verification methods to to get those guarantees. So, so we have examples of highly secure code. Now, you might ask like why why are we not using that all the time if it's so good? And the reason for that is that it just takes a lot of human effort to write code in this highly secure way. But a lot of that effort AI will be very very good at doing and already they actually already start to be very good at using systems like lean etc to prove properties. So, so yeah, the argument here would be it wasn't as much adoption as you might have thought because it was very expensive but the cost here is coming down and you know there will be other bottlenecks like getting the specs right is a bottleneck and you know you don't prove that you have the correct specs you need to like figure that out and iterate on that but then I think the answer to that is we should invest a lot in figuring out how to do specs well right like that's that's how we should do software in the future and maybe the last thing I just want to say here is that the word safe is can always mean quite a lot of things or if you're like well you know this system is completely safe like this is sort of unspecified what you mean exactly and I think ideally we're just much more precise so for example an SL4 system gives you in particular this like property of like strong isolation now that in itself isn't you know or is that like that is a component that gets you that makes it easier to reason about the composite properties of your composite system of whether it's like safe given what that means in the context. So yeah, I guess like one might be a bit allergic to being like, oh, it will just make everything safe. I'm like, well, that's just being imprecise, but we can build R&D artifacts that have very specific properties, and those properties can be security properties, etc. that that stack up in in useful ways. >> So the specification is important, and it's important that we try to invest resources into into becoming better at writing these specifications. Um, is it dangerous to have AI help us write these specifications? I'm thinking here that the AI model we might be working with has some set of values. That set of values will will compete or pollute the values we're trying to to write into the specification. >> And so if we're part partially automating or delegating this work, maybe we end up with a specification that we're not fully in agreement with. >> Yeah. So we definitely need to be very thoughtful about that part and and we definitely should not just close our eyes and let the AI specify things. But I do think there's a couple of things we can do to collaborate with AI including on writing specs in a way that we can still feel good about. There's a few different things that seem worth saying here. one is I mean one one thing I think to say is that the current um generations of AI systems uh frontier systems that we see they're actually in you know in some pragmatic way like pretty well aligned uh they seem they seem pretty friendly to me you know if you want to get them to do nasty things they tend to not want to do that and obviously they're like there's ways to you know jailbreak them etc clearly not you know perfectly aligned but they're pretty friendly like I'm sort of using friendly in a sort of like yeah I think like generally clothes and chachi bitty etc like are sort of trying to get what I'm saying. So so that's already an a good collaborator and they're already pretty capable. So even if we just had these systems like I think they give us uplift in in writing specs. Now the second point is we need to write specs in such a way that humans can review them right like for example this is why we want to write them down formally and very precisely such that you know there is a precise definition of what we want. I think if we for example if you had purely natural language specs and you said hey my specification is that this thing should be safe like that's just imprecise right like and that's not a good spec and so yeah we we need to write specs maybe with the help of AI systems in a way that's human auditable and and then there's there's I think there's a bunch of there there's so much you know entrepreneurial and creative energy that is you know in the world and like I think one great problem to try resolve is like cool like how can we like engineer great workflows for human and AI systems to work on writing specs at scale in a way that we can trust those those specs. So I think yeah that's that's a few of the answers. >> So let's let's talk about coy bargaining and how this might this might be sort of supercharged with advanced AI. Maybe you can explain the concept and then we can talk about how AI might help us actually implement it. >> Cool. Yeah. So cushing bargaining is um some economic work that basically says that yeah what would it look like for multiple agents to coordinate on coordinate on like reducing externalities. So like I think the typical example used is like you know you have two neighbors and let's say like have two neighbors and like one of them likes to play loud music and the other one is like no this is actually kind of costly to me I I don't like this. The question is could they under what conditions can can they coordinate on a solution that's like you know good and and and pretty optimal for both of them without for example a you know a state just like an external power just like enforcing some rule and sometimes this works and sometimes this doesn't and then you can sort of look into like on the why is it that sometimes we don't manage to coordinate on reducing negative externalities or or bargaining on negative externalities in a way that's good for everyone and often this is like referred to as um sort of transaction costs more broadly. So there's some costs to just directly bargaining with each other to find good solutions that make it inefficient, right? That make this what could have been an efficient market and make that market inefficient. Sometimes this is like broken down into like information costs, bargaining costs and enforcement costs. So if you know, you know, both of us would would would um you know, I want to give you a pair of shoes and you you give me a can of milk or something and then we were both happy because like we each get sort of relatively more utility from that trade, it matters that I can like trust that that trade occurs how we thought it would. If I really can't trust that, that makes it harder to trade. And I think we can sort of also think of like society as having like had a lot of like years of iterating on institutions that make finding those positive sum traits and and and engaging in them easier. So for example, legal norms and having like you know at the end of the day a state that enforces them right like I could take you to court if we both sign an agreement and you're like not not um keeping your word or something and then that sort of like enforcement in like setting the incentives such via sort of threat of punishment or something um that makes it more optimal for the agents to like keep their word or something. Um we also have like very soft things like just norms right like you know maybe we know we both know as a shared set of people and like it actually would be costly for for one of us to just sort of renegade on these things. So I think humanity has like built very rich layers of institutions that help us coordinate on on positive sum trades >> but but there are still many of these traits that fall outside of uh that are too costly for us to implement. So say that for example I I'm playing loud music and we live in an apartment complex together and I'm playing loud music on every second Sunday and this allow this sort of annoys you slightly but not a lot. You would if you could send me say $10 to make me stop. If you could be sure that I would actually stop playing the music and if so another cost might be that I perceive it as weird that you want to pay me money to to stop playing music. all of these kind of frictions for us to get a trade that we would both like to engage in. So this brings us to AI how AI AI might help here. How how do you envision AI sort of helping us making more positive sum traits? I mean the zumson's a bunch of ways, but I guess like a an intuitive starting point is something like the AI could go out and negotiate with other AIs representing other people like representing my my my neighbor with the loud music and my seven other neighbors that also do things that I [laughter] would like to coordinate on in parallel scalably without me needing to like invest that time or etc. So So that's one, right? like we could just do much more like types of traits that I in principle could do myself but I can't do them all at the same time. So, so we could just do more of it. That's one. But there's sort of even more exciting opportunities like sometimes reasons trades can't happen is because it's sort of like hard to um it's sort of not incentive aligned to share the information that will make us notice that there is like a positive um positive sum trade here. So, you know, I might not want to leak certain information, but if we could just both mutually know that actually we're like uh both are interested in that exchange, um that you know, we would be interested in doing that. So you could now imagine for example having an AI agent that represents me and like knows these things um and an AI agent that represents the other person and knows knows their private information and they could just especially if we then find ways to like have strong enough information security standards that they could sort of exchange that information with each other without either of us um either of the humans ever learning that information. And now what like used to be a trade that wasn't accessible to us because we didn't want to share that information becomes accessible because we can sort of share that information without actually sharing it. So that would be another example in which traits that were sort of not accessible ahead of time become accessible. And maybe the third category would be something like the agent might be able if we you know built the tools and built the ways to do this. An agent could in principle sort of make much more credible commitments or credible commitments in context where you know I as a humans couldn't. So they could you know literally build a system that can't do anything other than the thing it's claiming to do. and you know I can show you the code of that system and you can be like yep in fact that system seems to do exactly that thing and nothing else and that's just like a very credible commitment and again that might unlock positive some traits that otherwise wouldn't have been accessible. >> Amazing. All right. Do you want to point listeners in any directions in the end of this episode? Just um we've been talking about a bunch of topics and listeners that are with us so far might be very interested in these topics and so are there any places they should look for more information on these things? I will link everything in the description. >> Cool. Yeah, I mean there's a few so um they can check out the area website. There's a few different programs we have like so the safeguardi program is doing some of the work discussed. There's another opportunity space called trust everything everywhere that is is doing some of the some of this like cost and bargaining related multi- aent stuff that were mentioned. So there's some interesting resources there. There is a website called arire resilience.net at which I worked on together with my collaborator Eddie where we just sort of started to think about like okay what would resilience look like in the age of AI and like what are both what are sort of like um exciting end games where like actually we have maybe worked out um good ways of making ourselves more defense favored and then also trying to infer what are some like R&D priorities that seem like promising to work on. If for folks interested in the co and bargaining sort of story, there's a cool article by seper called co and bargaining at scale. No, some something relenging is in the title. Yeah, maybe that's it's a good good collection. >> Great. Final question here. So, we've been talking a lot about AI and listeners to this podcast are very interested in AI. Could you recommend a book or an article or a movie or something that's not about AI but that's relevant to the world that we're entering? >> There's a bunch of them. I guess I'm going to go with recency bias. I think seeing like a state by I think CS Lewis I think is the author's name is a is a great book that has to do with sort of some of those themes of excessive centralization versus excessive decentralization and sort of navigating sort of the polit political philosophy political economy questions of how we can sort of coordinate with each other and you know the role of the state and in what ways the is doing some really helpful things to make us coordinate and in what ways it's doing maybe some somewhat harmful things. Um it's a it's a really interesting book. >> It it's from it's the book from the late 90s, right? >> Yep. >> Yeah. It's by James C. Scott. >> James C. Scott. Wow, I said that really wrong. James Scott. Yeah. >> Any any other recommendations? >> Well, so I did like the the book called um Underground Empire. Yeah, there we go. Underground Empire: How America Weaponized the World Economy. there's certainly some uh geopolitical interesting insights. I think it was a very fascinating book um talking about how through the sort of just the infrastructure of modern the modern financial world that gave the US in especially in the 20th century a lot of sort of like soft power o over the world because they a lot of the finances ran through the US. So that was really interesting. >> Great Nora, thanks for chatting with me. It's been really interesting. >> Cool. Thank you so much.

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -14.44This pick -10.64Δ +3.799999999999999

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -14.44This pick -10.64Δ +3.799999999999999

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

Spectrum vs this page

This page -14.44This pick -10.64Δ +3.799999999999999

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs