Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Samuel Albanie on DeepMind's AGI Safety Approach

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 72 full-transcript segments: median 0 · mean -4 · spread -315 (p10–p90 -160) · 7% risk-forward, 93% mixed, 0% opportunity-forward slices.

Slice bands
72 slices · p10–p90 -160

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes alignment
  • - Emphasizes safety
  • - Full transcript scored in 72 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video y_CFJBR9TyQ · stored Apr 2, 2026 · 1,942 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/samuel-albanie-on-deepminds-agi-safety-approach.json when you have a listen-based summary.

Show full transcript
[Music] Hello everybody. In this episode, I'll be speaking with Samuel Albany, a research scientist at Google DeepMind, who was previously an assistant professor working on computer vision. The description of this episode has links and timestamps for your enjoyment, and a transcript is available at XRP.net. Two more things. You can support the podcast at patreon.com/exrpodcast and also you can very briefly tell me what you think of this episode at axrp.fyi. All right. Well, uh, welcome to the podcast, Samuel. It's a pleasure to be here. Cool. So, um, today we're going to be talking about, uh, this paper, an approach to technical AGI safety and security. It's by a bunch of authors, but the first one is Rohan Shaw, and you are somewhere in the middle of this list. And can you tell us just like what is this paper about? Sure. So the goal of this paper is to lay out a technical research agenda for addressing some of the severe risks that we think might be posed by AGI. I think one thing that kind of struck me when I was reading the paper is not well well okay it's pretty long um and so there there are some things in there that surprised me but like by and large it all seemed like pretty mostly pretty normal stuff right like if if you've been around the AI safety landscape I think like a lot of these things are don't seem like super shocking or surprising So I'm wondering like what's the I don't know like in some sense what's the point of it? Didn't we already know all of this stuff? Yeah, that's a great point and perhaps it would be a great sign of maturity of the field to the degree that when describing our plans there were no signs of novelty there. Uh but in many cases I think the goal of this sort of work is to lay out the approach and also try to expose it to critiques both internally within the company but also to describe the justification for certain choices and illicit comments on them that sort of thing. Okay. And and did you find that like um so have like like as you were writing it, did you find that um that process caused like you know you realized that you should be doing something different or like you already found some internal critiques? I guess it's like as we're recording this it's just being released so it's a little bit early for external critiques. Um well actually no you probably have received some external critiques but um it's it's early for like thought out you know um you know high quality external critiques but I'm wondering like yeah has it already changed plans or like caused uh people to think differently about things. Um yeah that's a great question. So I think at least from my perspective it has been very useful to work through some assumptions that I think were implicit in how we approaching research and then try to drill down and say okay what really is the evidence base for this and perhaps more importantly under which circumstances do we need to throw some of this stuff away and change some of the assumptions that are underpinning our approach. So uh one of the assumptions maybe we'll get into perhaps the one that is most uh I don't think nuanced is the right word but has some complexity to it is this idea of approximate continuity. Yeah, that progress in AI capabilities is is somehow smooth with respect to some of its inputs. And um it's sort of one thing to say this loosely in conversation and to drive research from it, but it's it was helpful at least for me to work through the arguments and think a little bit about okay to what degree do I find these plausible where are sources of uncertainty here? And uh I think I think there's a lot of value in that that exercise. Fair enough. Um and my understanding is that uh the assumption section is the part that you were sort of that you kind of did most of your contributions to in the paper. Is that is that right? I think that's a fair characterization. Yeah. Okay. Um so in that case maybe it would be good to just like start off just talking about what these assumptions are and um what you what their implications are if that's okay. Sure. Yeah. Yeah. So we can just blow through one by one or if you want to pick one uh to refer us into I I think like one by one like in the order in the paper seems good. So the first one is like current paradigm continuation um which is yeah how maybe I should let you say yeah how would you characterize just what that assumption is? Yes. So the the key idea here is that we are anticipating and planning that frontier AI development at least in the near-term future looks very similar to what we've seen so far. And if I was to characterize that I would say it's drawing on these ideas from perhaps Morovich the fundamental role of computation in the improvements in AI capabilities. uh the ideas of someone like Rich Sutton in his bitter lesson that a lot of a lot of the progress is being driven by some combination of foundational techniques like learning and search and that because we have seen significant progress so far within this paradigm though I accept there are some differences of opinion on that but I'll leave that aside for a second um and because it's highly plausible that those inputs will continue to grow that It's a reasonable bet when we're thinking about our research portfolio to make a pretty strong assumption that we're going to maintain in this regime. Okay. And and what Yeah. I'm wondering like so the term regime or the term paradigm like it can be a little bit loose. So how right? Yeah. How how what's not in the paradigm? Is that a is that Yeah. Or maybe like suppose we stopped using transformers. Would that count as would that violate this assumption? It would not. It's relatively loosely scoped here. So um roughly because uh we're thinking a lot in terms of the inputs to the process. So methods in in some ways I think of transformers as being quite a natural convolution, excuse me, not convolution, continuation of convolutional neural networks in a sort of a loosening of the inductive biases to make more general use of computation. And so if there were to be further steps in that direction, uh that would plausibly still fit at least in my mind in terms of how we're baking the assumptions into the plan very much still within the current paradigm. Whereas to take something that would be you know not inside the paradigm something like um brain uploads from Robin Hansen or something for which learning did not play a pivotal role in the acquisition of new capabilities? Okay. Um is the use of gradient descent a crucial part of the paradigm as you see it? Um that is a great question. So I think in terms of our so yeah maybe we can scope this out a little. So um do do you think in terms of alternatives do you have in mind something like evolutionary search or basically a a method that does not make any use of gradients? Yeah, I think I mean I don't have any particular thing in mind although yeah um I mean things I could imagine are like evolutionary search um I could imagine like maybe we move to using these like uh hyper networks instead of these like object level networks you could imagine uh which I don't know I guess probably that would be another you could do that with evolutionary search or you could do that something else you could imagine we start doing like a star search with these heristics that like um you Maybe we um Yeah. How do the Okay. Okay. As as I say this, I'm realizing that evolutionary search is like the one the one like non-gradient descenty thing that I can think of. Um you could imagine like iterative like like suppose we start doing like various like Monte Carlo thing sampling things. You could imagine that, you know, being iterative updates that are not quite gradient descent as we understand them. But yeah, I guess I'm like I'm not totally sure I have an alternative in mind. I just want to I'd like to understand like how how specific is this assumption because like you know the more specific the assumption is on the one hand like the less the harder it will be to believe but on the other hand the more like uh research directions will be justified by the assumption. Yeah, that's a great point. So, I would say it's quite a loose assumption. I think we have in mind here broadly learning and search does cover an extraordinarily broad suite of things. Evolutionary algorithms in some sense also would fit into those categories. Okay. And uh so it's it's useful I think for guiding things but um yeah to your point about this trade-off between specificity and how much unlocks versus how risky it is in as an assumption I would I would view this as among the looser ones that we're we're leaning on. Sure. Maybe so to to pick up um something you said a little bit earlier um yeah in terms of like what wouldn't this cover? So, it sounded like uh brain uploading like if we if we did AI by uploading human brains, it sounds like that would not be covered by this assumption. Mhm. Is there anything like more similar to the current paradigm or even potentially like more likely to happen by your judgment that um would still count as breaking this assumption? Um yeah, it's a good question. And I think I'm mainly thinking in terms of these properties of uh leveraging increased computation and uh making heavy use of the R&D effort that is currently underway. I think if it were to be the case that highly logical systems perhaps akin to expert systems could be constructed in a way that was not leveraging learning in a way that is close to how it is done currently and not leveraging search. It is quite difficult though for me to to come up with good examples. Okay, so potentially like some sort of like like if we had a super rational Beijian expected utility maximizer that was like computationally limited but got better when you added more computation. It sounds like that would potentially be um that would potentially count as the kind of thing that would not break this assumption that like maybe you would put some work into. Uh so that's a good point. I think that would require pretty heavy revisiting of some of our components. So to to give some examples, um I think we are quite tied to core concepts from machine learning when we conceptualize how we're tackling the alignment problem. So uh later in the document there's a description of how we're trying to get good learning signal through these mechanisms like amplified oversight. Yeah. And um you know implicitly that's making an assumption about how the model is going to be trained and uh it is plausible that that also fits into some of the more basian frameworks that you're describing. It's not immediately uh clear the jump to me but if it's a basian reinforcement learner, right? You could imagine like um you know there's uncertainty over some like underlying reward signal and like different amplified oversight activities provide more or less information about the reward. Like I think a lot of these things and and in fact a lot of the amplified oversight work I think was like conceived of in a or like if you think of like um work from Chai on corporative inverse reinforcement learning right it's like conceived of in this like very basian way and like a lot of oversight work you can think of in this in this sense yeah I suppose it depends what how we're using the term basian if it's effective propagation of uncertainty uh yeah that's I would fully agree that that's on board with with this um I suppose is um yeah I'm not sure that I have a particularly clear alternative as a way to frame that. Okay. Um and so so okay now now that we've got a decent like understanding of what the what the assumption is. So my understanding of the argument of the assumption is for the assumption is something like uh this you know the the current paradigm it's been working for a while. It doesn't show any signs of stopping. Um, and there's no obvious other paradigm that seems like it's going to uh swoop in and do something different. Um, and all these are decent arguments, but so so importantly I I think um near the start of the paper it says that this is basically a planning document up to the year 2030 and after that you know to some degree all bets are off. Um, and the arguments are roughly saying like, okay, you know, for the next 5 yearsish, we should expect the current paradigm to hold. I'm wondering like it would probably not be reasonable to assume that the current paradigm will hold for the next thousand years. So all of these arguments must have like some implicit time scale. I'm wondering like if you project out past 2030 like what is the time scale at which these arguments start breaking? Is it more like 10 years or is it more like 50 years? That is a great question. So maybe I should just clarify that um 2030 I I should double check what the framing what the phrasing is precisely but I think that's given as an illustrative date of um and I do think it is useful as a reference point but I don't think that the the plan is anchored specifically around that as a date. Uh do please feel free to correct me if I so so in the discussion in the introduction of no um in the discussion yeah when when you're talking about this assumption you say uh for the purposes of this argument we consider evidence relating to a 5-year future time horizon we do so partly as a matter of feasibility this is a time horizon over which we can make reasonably informed estimates about key variables that we believe drive progress with the current paradigm um we though we anticipate that the current paradigm will continue beyond this we also do partly as a matter pragmatism planning over significantly longer horizons is challenging in a rapidly developing R&D environment. As such, we anticipate that some revisions to our assumptions, beliefs, and approach may be appropriate in the future. Okay, then I retract my previous objection that is very explicit. Uh yeah so with regards to that that date part of the rational for using it was looking at the previous historical trend of how things have developed and then trying to make arguments about where we can expect the inputs to continue and 2030 is a cute date partly because uh there was this nice study by epoch that was trying to do a relatively fine grained analysis of which of the inputs currently used by the current paradigm could plausibly continue up to 2030 and then on the basis of some blend of fermy estimates and analysis they came to the conclusion that that was highly feasible and for that that's part of the motivation here. Okay. So, so it sounds like the arguments are basically like look as long as long as we can continue scaling up the inputs to this process and maybe like I don't know I can imagine some argument that says like look um maybe there's some Lelass's law thing where you expect to keep going about as long as you've been going so far. Um like and and oh is this like Lindy effect? I don't I don't know the Lelass thing that's new for me. Oh, um I'm just imagining like sorry I'm I might be wrong here but uh so Lelass's law of succession right like if you're suppose we're saying like how many years has it been since the since the current paradigm started and then we imagine there's like some underlying probability of uh the current paradigm switching and we don't know what that probability is and we say like well you know we imagine like we've well we we've observed the paradigm not fall for you know uh seven years or or maybe uh you want to give it 10 or 12 years or something. Um and then you can say okay well roughly like the rate of it the the the rate of the current paradigm failing per year has got to be like a little bit uh around 1 in 12 right if like we've seen 12 instances of it not failing and zero instances of it failing. I see. I had not encountered that terminology. That's a a useful one to know. Um yeah. So I think uh yeah maybe maybe to recurse back to your original question which as I understood it was um yeah do we expect it to to last far beyond that or why have you chosen the date perhaps it was both of those questions. Yeah maybe it was like in order to understand like whether the arguments really work for 2030 like uh when when would they actually stop working? I see. So I mean part of the mindset of this approach is um to give ourselves moderate time planning horizons and uh it is just uh highly likely that we would execute a replan over over that time scale. So based on current trajectory this seems like a a reasonable a reasonable future to to plan over but it's not a loadbearing assumption about what is likely to happen after that. Okay, with regards to specifically the scaling, I think um well, it remains to be seen. Perhaps one of the most notable inputs is training computation and epoch has been tracking that quite carefully as I understand it. And I think we are very much above the trends that they initially projected in the first study as based on say like the Grock recent uh training run reported at least in their their public public database. Um so that seems well at at the time of writing writing this particular section which was late last year. Gotcha. So so so actually yeah that's an interesting question. So, how how long have you guys been working on this? Like late last year. It sounds like this has been like uh like quite a long time coming. Uh well, I think there's a continuous process of um assessing the research lands landscape. Excuse me. I'll say that again. Assessing the research landscape and uh trying to integrate new developments into a cohesive plan. And there's there's always a degree of replanning that happens. And uh as as for why specifically this date, I'm not sure that there was I don't think I have a good answer to to why um like at what date the document was originally planned. Uh yeah, I don't have a good answer to that unfortunately. Okay, fair enough. Um cool. So I I think I'm probably ready to move to the um second assumption. Um unless there's more things you want to say about uh the paradigm continuation. Uh no I think I think good to move on. Okay. So um the second one is that there is no human ceiling. Um and so my understanding is that this is basically saying um AIs can be smarter, more capable than humans. Is that basically right? That is basically right. Okay. Um and actually maybe this is um this can be a jumping off point to talk about just the level of AGI that you talk about in the paper. So you mentioned that um basically uh you're going to be talking about this uh level of exceptional AGI which comes from this paper that uses the teruo AGI and it says it's like the level of the 99th percentile of skilled adults on a broad range of tasks. And I was kind of confused by this definition. Um, and maybe it just depends on like what skilled means, but like I think for for most tasks or most domains, it's like pretty easy to be better than 99% of people just by like trying a little bit. For instance, like if you uh well, okay, I suppose you have to pick a language that fewer than 1% of people speak, but like if you learn 10 words in that language, you're now better than 99% of people at that language. If you like learn the the super basics of juggling, you're better than 99% of people at juggling. Like it's probably it's probably not that hard to be a 99th percentile surgeon, right? Um but but but maybe this word skilled is doing a lot like like can can you help me understand what's going on here? The percentiles are in reference to a sample of adults who possess the relevant skill. So in the levels of AGI paper, the authors give as an example that performance on a task such as English writing ability would only be measured against the set of adults who are literate and fluent in English. Okay, it's uh not a completely self-contained definition because it is still necessary to determine what it means for an adult to possess the relevant skill. In the juggling example, I'd define that to be the group of people who could juggle. Yeah, perhaps with three balls. Fair enough. So perhaps the perhaps the next thing to talk about is the uncertain timelines assumption. Um uh can you say roughly what that is? Sure. Yes. So the premise here is that many people uh have spent time thinking about plausible time loans over which AI could develop. And there is still um perhaps not a very strong consensus over what the most probable timeline for AI development will look like. Perhaps you've seen in the last few days this nice article from Daniel and collaborators on the AI 2027 project positing one plausible future scenario. uh many people who have been surveyed across different disciplines have very different opinions based on the evidence that's available currently about what is plausible. And so the assumption is roughly saying short timelines seem plausible and therefore we should try to adopt strategies that have a kind of anytime flavor to them that they could be put into practice at relatively short notice accepting that there is some uncertainty here. Okay. and and is the assumption so you mentioned that um part of the assumption is that short timelines seem plausible. Um I guess for it to be uncertain rather than certain of short timelines like maybe part of the assumption is also that like longer timelines also seem plausible is like is is that half of things like something that you're intending to like include? Um and if so, how does that play into the strategy? Yeah. So I think one aspect of the plan currently is that it's a there's still a relatively I mean this is a subjective statement but there is some diversity in the portfolio. There are a collection of different approaches and in the most accelerating worlds some of those options do not make that much sense. But uh we're still in a regime where because there is this uncertainty, some diversity on the portfolio makes sense. That's that's sort of roughly the trade-off we're making here. So I think the the interesting part of this assumption comes in in the interplay with the next assumption. So the next assumption is approximate continuity. Um and this one like I think I actually misunderstood it the first time I saw it written down. So can you tell us just what is the approximate continuity assumption? Yes. So this is the assumption that improvements in AI capability will be approximately or roughly smooth with respect to some of the key inputs to the process. And the kinds of inputs we're thinking about here are computation and R&D effort but not necessarily something like calendar time. Okay. So if I put this together with potential for accelerating improvement, what I get is that it is plausible that um there's this kind of uh you know increasing you know ever quickening cycle of improvement where um well maybe compute goes in um you know uh relatively continuously with calendar time but like R&D effort like increases and increases like quite quickly. Mhm. Um and improvement in capabilities is like pretty smooth with the amount of like R&D input um and the amount of compute but in real time plausibly R&D input increases like very very quickly and therefore capabilities increase very very quickly. Yes, that's right. And and the thing that confused me here is that in the approximate continuity like like my understanding of the consequence of that assumption is that you could you could have some sort of like iterative approach where you know you do empirical tests see how things are going and then like you know things will go like that for a little while because it's continuous but like if if things are going very fast in calendar time I would have imagined that it would be pretty hard to like like if I imagine trying to do an iterative approach. What I imagine is like I do some experiment, you know, it takes me a little some amount of time to do the experiment. Then I like think about the results for a little while and I'm like, okay, this means this. And then I like, you know, work on my um work on my uh my mitigation for any problems that, you know, or or I implement something to, you know, incorporate the things I learned from that experiment into what's happening. And as long as I as long as like I or another human, I'm the one doing that, I would think that that would be like pretty closely related to calendar time. But if like things are not necessarily like continuous in calendar time like then I'm confused how this approach is able to work. Yes. So one framing of this is that because and it it does rely on very careful measurement of the R&D capabilities of the models. Okay. So as calendar time shrinks the assumption here is that in the scenario you're just describing the R&D capabilities net is growing very significantly. Yeah. And so what corresponds to a delta will be perhaps very very short in calendar time. but nevertheless can still be tracked and the replanning and reassessment of risk needs to happen at shortening time scales. Okay. And and so if there was to be a mitigation or a pause or a stop, uh that is how it would be implemented. Okay. And I guess I'm like like yeah maybe the thing I'm confused by there is it seems like it might happen faster than we can like like it takes a while to consider things and to think about like how to do a mitigation and so is the thought like well this is feasible because like the AIs who are doing the R&D will be thinking about all of that or is the assumption like at this stage you know like all we're doing is going to be like keeping track of the mit you know we're we're like you know you're not you're not writing papers on like uh various types of optimizers anymore the AIS are doing that all you're doing is like thinking about how to react to changes in the R&D input Like yeah, I I guess I'm wondering just like what does it look like to actually implement this in a world where you're growing like capabilities are growing super super quick in calendar time but continuously in R&D effort. Yes. So uh the way that I've been thinking about it is there are measurements being made and a continuous assessment of safety buffers projected into the future. Okay. And as progress goes up, there's a sort of scanning horizon over which we think we can continuously perform the kinds of tests, mitigations, and checks that we think would be necessary to continue to the next stage. Okay. And those would become closer and closer in calendar time. Yeah. And if we hit some component of a system, some quantum, some setting that meant that it was not safe to continue on the basis of the shortening time scales, uh then the system would have to stop. Okay? It's more that that's not um you know like a foundational axiom of the plan that would just be downstream of the fact that a mitigation was not appropriate for a certain time scale. Okay. But in principle it it's not as a consequence of the shortening time scale itself. Though it may in practice be the case that that is a limiting factor because we're not able to operate a system that we feel comfortable with. Okay. So, so the thought is something like um at any given point in time you'll have some like safety measures or whatever and you you can see that they work pretty well and you can see that they're going to work um for the next uh let's say doubling of R&D input and then like once you've like 1.3xed R&D input you like figure out some new safety mitigations that will like you know bring you further past that and then you know at this stage you figure out mitigations that will happen further further past that. Um if so it's uh does am I understanding you correctly this far? Yeah, that that is that part is correct. Yeah. Okay. But in that case it seems like the well I guess this seems like we're going to have to like really be leaning a lot on the AI R&D work for force to do a lot of the work of coming up with these like new safety mitigations and stuff. Um if like I'm having like like suppose these milestones are coming up every every like three days for me. Um you know may maybe you can think of safety maybe deep mind just has all these people who can like think of all the necessary safety mitigations in 3 days but then like then it speeds up to it's every like one and a half days and it's it's too fast for even the Google deep mind people. Um so so am I right that like dealing with this seems like a lot of the work is going to have to be outsourced to these AI um AI researchers in the regime in which things are moving quickly. Yes, that is that is a fairly foundational component. Okay. And the risk and most likely one of the things that will cause a risk assessment that says things need to pause or halt are the complexity of establishing those schemes. Okay. So if it is the case that we cannot get to a sufficient level of confidence that the scheme can continue that is the kind of thing that would stop progress. I think this helps me get a better sense of like what's being assumed here and like that the actual work that this assumption is doing and and also like the limitations of it. Maybe this gets to um perhaps like a thing that struck me as different um between this plan and or a thing which I thought was different between this plan and some others that I've seen. So if I look at um so I don't think this is like Anthropic's official safety plan but there is this uh blog post called the checklist that Sam Bowman wrote that I think was like yes relatively influential and it is basically framed around we should create um we should you know automate AI research and development in particular like we want to automate safety work and like all of our work right now is to figure out how to automate AI safety work and at the start of the approach to technical AGI safety and security. One thing it says is our approach in this safe in this paper is not primarily targeted at automating AI safety research and development. I'm I'm compressing that quote a little bit, but I think that and I don't know hopefully that's a fair characterization of it. Um, and on the one on the one hand I was going to ask like okay well why why is there this difference? But it sounds like if I combine the potential for accelerating improvement and approximate continuity, it sounds like this plan really is going to rely very heavily on automated AI safety research and development. So I guess I'm confused. Can you help me understand what's going on? Sure. Yeah, that's a great question. So I think um one framing of this is that that approach is implicit in our plan if it is the case that things if the trajectory rolls forwards in a certain way that is to say that if AI development does accelerate very quickly okay and if it was the case then our our plan moves closer and closer to that setting okay we're more you know in some sense it's a slightly more diversified portfolio currently that would collapse or concentrate according to how things develop. Okay. So, so when so when it said it was not primarily targeted at that goal, it sounds like how I should understand that is you were not assuming that you definitely will try and automate AI safety research and development as a thing. But you also aim to make sure that you could do that in the world where that's possible which you regard or in the world where like um where you have these like accelerating um research AI research and development um which you think is plausible right that's that's correct and it's not that we would escape any of the as you are no doubt aware there are many significant challenges to be overcome to to implement uh that strategy. I think it's discussed briefly in the paper this idea of bootstrapping and the challenges of using one aligned assistant to align its successor. Um g given those difficulties it is highly plausible that progress is bottlenecked by an inability to make a strong safety case that progress can continue. Maybe we should move on to just approaches to mitigating risks um described in the paper as opposed to uh the assumptions unless there's more that you want to say about the assumptions. Uh no, that sounds sounds good to me. Okay. So it seems to me that the um two types of risk that are or or perhaps types is a slightly wrong word but um the two things that the paper talks about the most are misuse and misalignment where misuse is roughly like somebody like directs a model to do a bad thing and the model does the bad thing. And misalignment is the model does a bad thing kind of knowing that it's bad but not because someone else got it to do the bad thing. Um, is that roughly right? Yeah, that's that's that's a good summary. Okay. So, I mean there's some slight nuances, but I think that's a that's a good high level. Oh, I'm I'm curious about the nuances actually cuz one thing I noticed is like if I think if I at one point in the paper misuse is described as a user gets an AI model to do a bad thing and misalignment is described as an AI model deliberately does a bad thing or knowingly does a bad thing and in that definition like misuse could also be misalignment right yes that is a good point uh the risks don't form a clean categorization they are neither exhaustive nor exclusive. They are not exclusive in the sense that you could have, for example, a misaligned AI system that recruits help from a malicious actor to exfiltrate its own model weights, which would then be a combination of misuse and misalignment. Yes, on the other hand, given the scoping of the paper, we don't cover all possible risks like AI suffering, for example. The main benefit of the risk areas is for organizing mitigation strategies since the types of solutions and mitigations needed tend to differ quite significantly depending on the source of the potential harm. So misuse involves focusing on human actors with mitigations like security, filtering harmful requests and so on. M while misalignment requires focusing on the AI's internal goals and learning process and involves better training objectives and provide oversight and so on. I think that's fair enough to say. Um so so that's one concern about um perhaps uh overincclusion of things uh or overincclusion of requests into the like inherent misuse bucket. Um perhaps another concern is underinclusion. So, um, one thing that I believe you mentioned in the paper is one example of a thing that could count as misuse or as misalignment is you have one AI asking another AI for information that helps the first AI do bad stuff and the first AI is misaligned and the second AI is misused. Um, and it strikes me that uh, so a it strikes me that a lot of discussion of misuse is like imagining things that are roughly human actors, right? Um, like like a guy or like a collection of people is going to make a nuclear weapon and we don't want that to happen uh because they're the wrong people to have nuclear weapons. Um, it does strike me that AIS, like the the information, the the requests that we don't want answers to other AIS could potentially be different from things we're imagining in the CBRN space. For instance, like how how do you evade certain um controls and stuff? Um, and and not only is it so so with with the previous answer, I think it's fair enough to say, look, it's not really a technical question that we're trying to address, but like what information it would be very dangerous to give another AI. It does strike me as more close to a technical question. So, I'm wondering like do you have thoughts on what you know what dangerous requests look like in the context of you know AI interacting with each other? Yeah, that is a great question. It's not um it's it is deferred in and left out of scope for this technical document, but it is something that people are thinking a lot about. I don't have a like a a great pre-baked answer other than to say it's something where as the capabilities continue to improve. I think that that threat landscape is becoming much more salient and I just expect there to be significantly more work going forwards, but it's not something that's in scope for the work here. Would would you say this um kind of uh falls under the regime of um sort of access control and monitoring monitoring um in the misalignment mitigation section? Um so I think to some degree there are components of that but you have described exactly you know the potential of one failure case of this scenario. uh the case in which harm is achieved in aggregate or uh risks are accumulated peace meal across many actors such that no individual actor perhaps you know across different AI developers uh we we're not explicitly handling that in this in this approach fair enough um so the so perhaps to to get back to sort of the more core misuse thing so you talk about um you know just you know doing threat models and evaluations um for specific mitigations that are safety post training capability suppression and monitoring um and also like access restrictions um which I think like makes a lot of sense in the light of you know which requests are dangerous depends on who who's making the request um but you also have this additional section which is like okay um you security in the sense of I believe like security of the weights of the model um and also societal readiness um are also aspects of the misuse um section of the misuse domain. I guess I think security of model weights is a thing probably a lot of people in the AI safety space have like heard about or thought about a little bit. Societal readiness seems like if anything like perhaps um under underrated or under thought about in these spaces and I'm wondering if you have thoughts um just about what that should look like and how that or how that looks especially from like a technical angle. Yeah. So I think one example that's a nice one to give the idea here relates to cyber security and I believe this is the one discussed in the paper where as AI's become more capable at cyber offense. One way to reduce the misuse risk is to contribute those capabilities to the hardening of many bits of societal infrastructure which currently well I'm not um you know well qualified to make an assessment on the overall risk state but vulnerabilities exist in many cases. Yeah. And that's an ongoing process of hardening. Yeah. And I believe a a previous guest on the podcast, Jason Gross, is thinking about this to some degree. Oh, great. Um, so, so is this mostly thinking about, so it sounds like this is mostly thinking about, okay, using existing AI in order to harden up bits of societal infrastructure, to make bits of societal infrastructure less um, vulnerable to things. Perhaps like using AI to if there are some way to use AI to um to make it easier to make vaccines for things or to make it easier to make uh things that stop you from being damaged by a chemical weapon. It sounds like that would also fall under this umbrella. That's that's the that's the key motivation. Yeah, fair enough. I'm wondering one thing that feels related in spirit although less technical is my understanding for one re so uh there there are various labs such as Google d mind mine such as openai such as anthropic that um you know kind of work to release models to the public and one reason is you know they do cool stuff and it's valuable to have them be released but I think another theory of change for this is just like is useful for the public to know what AI capabilities actually are so that they know how worried they should be so that they know you know like what what things they should want to be done about it. Um it strikes me that in some ways this is like like if I if I think just colloquially of societal readiness for AGI it strikes me that like at the moment this is probably like the biggest thing driving societal readiness of AGI. I'm wondering like is this is kind of yeah does this count as like in scope for what you're thinking of as societal readiness? Oh that's a nice question. Um so it is certainly the case that I share your sentiment that that is one of the most effective ways to increase current readiness though there are clearly trade-offs here. Um I yeah I'd have to think a little more as to whether it was motivated from the same angle, but certainly I I think it it does have commonalities. I believe your phrase similar in spirit is a good way to characterize it. Fair enough. Um it's a little less explicit. I mean that you know there are so many there are so many other things going on there but uh perhaps a positive side effect. Fair fair enough. So, and then finally, um, with misuse, you mentioned that, okay, there's going to be like basically red and blue teams to stress test misuse mitigations and also safety cases to, you know, some sort of structured argument for why misuse is like unlikely or impossible. And then you can, you know, try and investigate the assumptions. I think this is I also look at kind of see this in the um in the misalignment section. you know, there's like these red red blue team exercises um red teaming assumptions um getting safety cases for alignment. Um I'm wondering like yeah, do do you think these are going to look like very similar or do you think they they look pretty different? Um, and if they look different like I don't know how did it come to be that the that the asurances for misuse and for misalignment looks so similar structurally. Oh, that's a good question. I suppose with many of the cases in misuse as we're characterizing it, we have some experience and fairly concrete ideas of what the risk factors look like. M and I think that concreteness lends a lot of opportunities for the sorts of strategies that red teams can be expected to deploy. There's a pretty clear idea as to who potential threat actors are, the kinds of strategies they might use, and in the case of the misalignment work because some of these threats and risks are they're not novel necessarily conceptually, but our experience with working with them is is relatively new. I do expect there to be some some dissimilarities based based from that perspective. Fair enough. So So like to give some kind of concrete example um when thinking about misuse. Okay. Um yeah, know your customer style checks are leveraging external history of a particular user in the outside world and using that as evidence about their intention and that kind of affordance is not going to be available in the misalignment setting in the mitigations we're setting and there I expect there to be many such cases that distinguish between them but at a at a broad level adversarily testing the robustness of the the system is a kind of a generically good thing to do. Yeah. Well, they know your customer. I mean, in some sense, this seems similar to like, you know, stuff like um access control for AI. Um, access control would be similar, I believe. Yeah. And in some sense, it's kind of similar to know your customer, right? Uh, well, there's two there's two things. One is access control and the second is the kinds of evidence you're accumulating about whether something can be trusted. Fair enough. Fair enough. Um but yeah, it does it does sort of remind me that there has been um some amount of stuff written about um infrastructure for AI agents that um you know comes sort of close to infrastructure we have for humans doing things that could potentially be dangerous. Um but yeah, it's it's fair enough to say that um for misuse, we're potentially thinking of things that are more precedented. Um, I wonder maybe is that a consequence of the assumption that we're only looking for the uh exceptional AGI? Like I I could imagine a world where AI gets good enough that humanity learns of some like weird dangerous things. So, so Nick, I believe there's some book where Nick Bostonramm uses this example of like, well, we could potentially live in a world where if you took some sand and you put it in the microwave and you microwaved it for 5 minutes, you got this like highly dangerous explosive. Um, the vulnerable world. Yeah. This vulnerable world. And you could imagine that like maybe we develop AGI and at some point it teaches us of these vulnerabilities, you know, like we don't just have to worry about like nuclear weapons, we also have to worry about sand weapons, you know, or like uh some other thing that we haven't like about before. Um yeah, I hinstein is so scary. Okay. So, so Einstein um yeah, as as you mentioned in the paper, it's this uh story from it comes from the story by Kurt Kurt Vonagut where um it's this like uh different version of water that's uh solid at temperatures below like 45° C and any normal water that touches ice becomes solid and then like it just takes over the world like that. So that hasn't happened in water, but that really has happened like with certain chemicals in the world. Like there are drugs that don't work anymore because like basically this thing happened like more than one of them. It's it's like the it's one of the creepiest it's I don't know this fact just creeps me out so much. Um where was I? So is I think you were probing about novel. Well, there is this component of there may just be a lot of unknown unknowns. Yeah. That are baked into the ecosystem that will be revealed as the models become more capable. Yeah. And I'm wondering like if if we think that like like if you're thinking of misuse as like okay there are basically like known dangers, is that a consequence of an assumption that we're talking about AI that is a little bit smart but not wildly superhuman? Um I would so the the comment on known dangers I think um I I perhaps would use that more as a reflection on the the maturity of those fields currently rather than maybe an fundamental distinction between them. Uh just because the relative capabilities of a AIS and human threat actors are in the state that they are currently but the affordances of both I do expect to change over time. For example, risks that come from the fact that the AIs can absorb very large amounts of content concurrently or execute at extremely high speed will mean that plausibly there are risks that were not uh tractable in the case of human operatives that are now are now tractable. Yeah. I I mean it seems like it plays into the mitigation. So suppose you're doing like um like like the misuse uh mechanisms, right? There's like safety post training, there's capability expression, and there's monitoring. And it seems like those rely on knowing which things you have to like like knowing which things you have to postrain the model to not talk about, knowing which capabilities you've got to suppress, and knowing which things you've got to monitor for. Whereas like if AI is smart enough, they can discover like that it can learn about a new capability or it can learn about a new vulnerability in the in the world. um it lets some humans know about it and then like humans you know start exploiting it. If that happens before um developers are able to realize what the issue is, figure out what capabilities they should suppress, figure out like um you know what thing what questions they should get the model to not not answer, figure out what things they should monitor for. I I think in that world mis those misuse mitigations become weaker. And so it's it seems like it seems like there's there must be some assumption there unless I'm misunderstanding how general these tools are. No, that that is correct. There are there is explicit threat modeling that goes on to try to identify the kinds of misuse risks that we think should be prioritized. Um explicit thought about what capability levels pose risks for certain threat actors and then mitigations are implemented downstream of those. And so there is there needs to be a kind of continuous scanning of the horizon for new risks that may materialize. But okay, it is not the case that they are sort of baked in in some implicit way into the plan. Yeah. And and I suppose like one nice thing about that is that if you're a model developer and if you're worried about new vulnerabilities being found by AI, if you have the smart AI before anyone else does, then like maybe that helps you scan the horizon for vulnerabilities that you should care about. and you might hope that you'd be able to find them before other people do. Uh there's there's that it is you know these things are very they have these dynamics of a so you know a wicked problem they're very integ I think this is often described as one of the challenges of an open- source approach where if it was the case that such a vulnerability was discovered the inability to shut down access um you know that there's an additional challenge. It may still be the case that the the trade-off is worthwhile under the collective risk judgments of society, but those that's a trade-off with the different approaches. Sure. Um so maybe we should talk a bit more about um kind of the misalignment um mitigations discussed in the paper. So at a high level there's I I take the misalignment mitigations to be okay try and make the model aligned try and control the model in the case that it's not aligned. Um do some miscellaneous things to make the things you've done work better and also you know get assurance of good alignment and good control. Um does that seem I think that's a good characterization. Okay. Yes. Cool. So for alignment there's amplified oversight guiding model behavior and robust training. Um, and this is like I found this kind of interesting in that it's a little bit different from what I think of as the standard breakdown of how to do alignment. So I think the standard breakdown I sort of conceive of as like do a thing that people usually call scalable oversight which I think is close enough to what you mean by amplified oversight and then you know deal with um this potential for inner misalignment where even though we haven't rewarded even though we haven't like reinforced AI for doing bad things it does bad things anyway because it wanted to do bad things at one point and then it decided to play nice for a while. Um, so just figure that out, you know, some somehow deal with that. Whereas like amplified oversight, guiding model behavior and robust training, it it seems like a bit of a different breakdown than what people normally talk about. So I guess the first question is amplified oversight, is that the same as the thing people talk about when they talk about scalable oversight or do you mean to like draw that boundary a little bit differently? Uh, so in both cases, debate is kind of a canonical method. One reason for the amplified oversight distinction is just that the term scalable oversight has been used for many things. Uh we use it something somewhat similar to say Sam Bowman's paper uh describing using this terminology. There's a a technical definition in the paper which I believe is something like for a given input output bear achieving oversight that is as good as could be achieved by a human if they understood the reasons why an output was produced by an AI and they were given unlimited time to reflect on the decision. Sure. Maybe to comment on your your previous remark about this breakdown. I think these these do map still relatively closely to the distinctions you made if we use the terminology. So we often use this terminology of specification gaming roughly outer outer alignment or outer misalignment if it's gaming and um goal misgeneralization sometimes the term inner misalignment is used for that and the latter component of the plan that you mentioned the robust training is targeting goal mas generalization okay so to some degree there's an overlap there sure so so so if robust training is targeting the goal generalization what is the guide model behavior thing doing guiding model behavior. Yeah. So, so the the core premise is uh let us suppose we have a good solution to amplified oversight. It is likely to be extraordinarily expensive as a mechanism for obtaining high quality supervision. Okay. And therefore we want to get the most juice out of that we can. And one way to do that is with standard training techniques. But there is a possibility that there will be other more more data efficient ways to do this. And so the the guiding model behavior is just trying to encapsulate you know how we're actually going to apply these these labels to the model. It could be these things like natural language critiques or if there are other mechanisms that will make use of the labels more efficiently. Okay. So so to help me understand um this a little bit better. So suppose I take um constitutional AI where roughly what happens is a human like writes a constitution. May maybe this is done in an amplified method where they like think about it really hard with AI help or something and then you know and then some like reward model looks at the constitution and looks at AI outputs and grades them. Would would that count as the kind of thing you're talking about in guiding model behavior or is that something else? Yeah. So the process of translating the constitution into the learned behavior of the model. H that's roughly what we're encapsulating there. Okay. And then to the degree that we felt that or that it was thought that somehow the constitution was underspecified then you would come into the regime closer to the robust training the selection of samples and active learning and mechanisms to make sure that you have good coverage. Fair enough. Um yeah I guess I'm wondering where the where the line is between guiding model behavior and robust training. um like like they they they have slightly different vibes, but I think of robust training as like training mechanisms to make sure the model does the thing and guide model behavior also sounds like training mechanisms to make sure the model does the thing. So like if I have like something like amplified sorry if I have like adversarial training like maybe that counts as robust training. If I'm like providing um if I'm trying to provide uh reinforcement to the chain of thought like I might hope that this makes the thing more robust but uh maybe it also is for guiding model behavior. Um in in real life I think probably that's a bad method the the thing I just said but like um yeah where do you see where do you see the line between these two things? I think the the key component is primarily just this emphasis on getting robust generalization. Okay. So to the degree that that comes for free from your training method, then you're you're good to go. Okay. But uh since we we often expect that we might need explicit approaches for for achieving that, that's roughly what we're trying to encapsulate in the in the robust training method. So So I guess maybe it's a distinction between research directions rather than between um techniques. So like the research direction of like providing oversight just anywhere you can that maybe that counts as guiding model behavior and the research direction of like making it robust as you can maybe that counts as robust training but like maybe there's a bunch of things that like could come out of either research direction. Uh yeah so I I mean so I may have misunderstood your point. I I think that to me there's there's still a relatively strong distinction. You know this first component get really good labels. M second component use those labels to train the model. Yeah. And the third part is really focus on making sure we have good generalization. And if that yeah I may just be repeating what you what you previously mentioned but to the degree that that is covered implicitly by your second part. You could fold them in if that's if that's a cleaner distinction for you. But this is the third part is just to say this is an important part to focus on. Yeah. Yeah. I Okay. Maybe I should just stop making Yeah. uh noises. No, I I it's it's good it's good if we can can get it clear because maybe I may have misunderstood or Well well I think that um so so like guiding so making sure that you're applying your labels in a smart way like in some sense like it seems like the distinction is when you're coming up with techniques are you thinking more about generalization or are you thinking more about like label efficiency? Um but you might use the like the same or very similar techniques in both. Um and you you know you you might be doing like very similar things which is relevant because like to the extent that you were thinking of like the first one as the quote unquote as the specification gaming one and the second one as the as the go generalization one. It seems like guide model behavior it seems like that could help with like either specification gaming or goal generalization or both. you know, just depending on how you're doing it. That that is fair. Yes. Um, which is like and I don't know like like to the degree that you think like specification gaming versus generalization is definitely the right way to carve up all problems. um then that's going to give you one perspective and that I don't know may like if you think the guiding model behavior is very different from robust training then maybe like you want to think of a different breakdown that is like slightly different from that old breakdown and I I don't know that that strikes me as kind of interesting I guess I see so let me try to paraphrase and see if I've understood your point your point is in the past many people have had two boxes we have three boxes three is different from That's part of my point. And then part of my point is when I look at guide model behavior and when I look at robust training, they seem like they seem like they maybe blend into each other. Like it seems like there could be common out like they're both fundamentally about like how to train things and where what you do and where you apply you know reward signal. I think that is fair. Yes. So yeah let you then talk about you know various like methods that can basically make you know other uh you know mitigations for misalignment work better and one of them is interpretability and at the start of the paper there's this or somewhere in the paper there's this interesting sentence that says um interpretability research is still quite nent and has not yet enabled safety crucial applications um and the conclusion is therefore that more basic research is needed. You might think, you know, people have been working on interpretability for a while. Um, you might think that at some point if it hasn't enabled any safety crucial applications, we should stop doing it. So why is the decision why is the thought more basic research is needed versus let's just give up. Yeah. So I think a few things come to mind here. So one is just about relative effort that has been expended into the field. It is true that effort has gone into understanding your networks. Uh but as a as a total fraction of all effort, I don't have a good sense of being able to quantify it, but it's not clear to me that we've exhausted the limits of what is possible by pushing more effort in. So I guess it really comes down to like what is our expected return on investment. And there there's a bit of a riskreward calculation. And so part of the incentive here is to think well you know big if true if we did get these benefits they'd be really big there is some uncertainty and maybe they they're a slightly risky bet but that in itself is is part of the core justification there's a second slightly more pragmatic component which is that uh in teams of which I think our team is an example there are a collection of individuals who have differences of research taste and different perspectives on what is promising and we allow those also to inform the overall direction. It's a kind of combination of bottom up and top down and so if people have clear visions and clear perspectives of how they think something has a tractable route to action uh that's also an argument for going forward. There's one other point, but I I can skip it for the sake of not talking too long on one topic if it's not sure. Well, I actually love talking too long on one topic. Um, okay. Perhaps it's a vice. Well, in that case, so one thing I think quite a lot about is this idea of um the how things can act differently at different scales. Um this is you know now I suppose this has been widely studied. My first encounter with this was in the analysis of Hamming, looking at how in many fields uh as the parameters of the field change, sometimes the science changes. So for example, if you're in biology and you have a a lens that allows you 10 times greater magnification. You just start to see fundamentally new things. And in the field that we're currently operating, we're sort of blowing through many orders of magnitude and various axes. And it may well be the case that the field is in some sense fundamentally new or looking at new regimes and and opportunities that were not there previously. That's the second reason why you some uncertainty over over what is possible is also seems appropriate. Maybe to to to go back to some of the things I've started with. I'm wondering how this whole process has shaped your thinking on the issue of technical AGI safety. Like for instance, has it made you feel more confident in the assumptions? Has it made you feel less confident? Has it like changed your views on like which research you're more excited about? Yeah, that's a great question. And I think one of the primary consequences for me is that it encouraged me to look much more deeply into one of the specific scenarios, the ones that we discussed related to the most aggressive acceleration and to focus more of my own research effort around those scenarios, accepting that it's plausible that they don't go ahead. But for some of the reasons we discussed earlier, these are, you know, these are sci-fi scenarios to think through and very challenging conceptually to reason about. And so perhaps the greatest update for me has been okay to to look at the arguments in some detail about how plausible th those sorts of feedback loops are and to upweight their importance at least in my own mind and to spend more time on it. So if if if list listeners want to think about this a little bit more so that you know obviously there's the the section in the paper talking about it and you mentioned um this work by epoch um looking at the returns to AI research and development. Is there anything else that you found like especially useful um for trying to think about you know what the scenario looks like and the likelihood of it. Yeah. So I think some of the nicest writeups of this are the work recently put out from forethought. Uh this would be Tom Davidson, Will McKascal. I believe there's some other authors. I I can't recall off the top of my head. Uh that has tried to analyze questions like okay what is the plausibility of an intelligence explosion? What kind of dynamics are likely to play out? Uh they they sort of do these taxonomies looking at well what if it was to happen only in software? What if that then progressed into chip design and then later into hardware ultimately leading to an industrial explosion? What kind of timelines are plausible? Um there's lots of nice analysis that's been put out on those those questions and then you can go in and critique it for yourself and then you could or one thing that I've tried to do is to connect it back to some of the more recent work and I think meter has done a fantastic job of this of conducting evaluations of of current systems and try to get highquality evidence about where we are today and what kind of trend line we're on and then try to try to bring these two things together into the same picture. uh that aiming for that kind of synthesis is is one of the things I've been thinking about a lot. Yeah, that makes a lot of sense. Um any preliminary results from trying to do that synthesis? Um so I I I'm a big fan of the the recent work from beta on the task horizons of AI agents at the frontier and um I've been trying to grapple with like do I think these are representative? Do I think this is roughly how progress is going going to go? Um, and just the process of trying to operationalize these claims which are very vague and somehow based on vibes in many discussions about like is progress fast? Well, I use this chatbot and it did this thing for me and I have these three test cases and two of them never worked before but now suddenly it works. Um, I I really like these efforts to to formalize things. I I also think that they highlight some of the real methodological challenges of making good work here and they uh to their credit they're very precise in in documenting all of the the nuances involved. But just to give one concrete example I think there's there's quite an important distinction between what they describe in the paper as low context tasks and high context tasks. And for the sake of making comparable B benchmarks they use low context tasks. These are roughly tasks that don't require a lot of onboarding. Yeah. But uh onboarding as a phenomenon I I personally think though this you know could be could be falsified with time may be a key advantage for the models over humans in many regimes. And so if we do not account for that when estimating task durations that's something that could cause a skew in one direction and the the time horizons. There are many other cases of things in other directions, but there are many details that you have to get into to to do this kind of analysis. And I think they've done a great job of doing some of the first work here that is is pretty rigorous. Sure. In terms of onboarding be being a key advantage that AIS could have, is that just because they have like like if you have a language model, it's just read all of the internet and so it knows more background information than any given human does. A lot of it in my opinion is to do with bandwidth. So as a human executing a task, we tend to spend some time learning on the task. Let's take a particular coding project and we sort of amortize that time spent getting familiar with a codebase or learning about the tools and technologies that we require and mortise it across the subsequent tasks that are relevant to it. Whereas the model operates more in a regime where it may be able to perform all of that on boarding close to concurrently. Uh you know with a very large context window absorb much of the relevant information but so far it has not been the case that that information has been directly made available to the models. So there may be something of a a kind of a context overhang here where if you think how you as a human execute a complex task when you're doing on boarding you access lots of kinds of information that we're not currently passing to the models and it may be the case that as that information becomes available then their ability to execute some of these tasks go up. I it's not clear that this will absolutely be true or the case, but it's an example of a you know a nuance that you get into once you really try to operationalize these things that that could quite have quite big consequences for the for the projected timelines. Fair enough. Um and the and so you mentioned that thinking about this had shaped like like you thought about like what kinds of work you could do that would be relevant to this scenario. Like what what did you end up thinking of? Uh so I' I've spent time thinking about um a few directions. One is learning more about model weight security. It's plausible that that will become quite important in worlds in which capabilities grow quickly and a sort of cursory knowledge is somewhat insufficient to make good good judgments about what is likely to happen and and how things will play out. A second thing I've been thinking a lot about is tools that can improve decision- making particularly for people who will be in uh positions of allocating resources. If we are in these regimes where calendar time shrinks, we want to have done a really good job of supporting them and setting up platforms and ways of processing information that are succinct, high signal to noise and also robust to misalignment threats. Yeah, that that seems that seems right. I guess another thing that's just I've been thinking about is that um and maybe this doesn't count quite as a technical approach to misuse or misalignment but to the to the extent that like some of the assumptions are like it is plausible that we have very short timelines and it is plausible that like we have accelerating improvement you know pro probably one of the most relevant uh things to do is to just like check if that's true or not you know like to get as much leading indicators as we can um and I off the top of my head I don't actually know if This is discussed in the paper. Um it's it's not something we go into much detail in this paper. It is something I've given some thought to. But it is a very difficult question. There's sort of two questions here. There's is it likely, how likely? And then there's a second question of when and we can in some sense it's easier to get evidence about the second if you have a a model or some smoothness assumptions about how things are going to go. But on the plausibility question, uh there are very interesting discussions. I I'm yeah I will just I think refer readers to the to the forethought writeups on on their assessments of various factors affecting plausibility. Fairly but I agree it is a very important question. So okay we've we're probably going to wrap up soon. I'm wondering is there anything that you wish that I had asked that um I have not yet? H uh I don't believe so. Okay, fair enough. not one that I can come up with quickly. Okay. Well, I guess to conclude, um, if people are interested in your research and they want to follow it, how should they go about doing that? I have a profile on X. Uh, my username is Samuel Albany. Okay. Uh, no, no underscores, no dots, no underscores. Okay. So, Samuel Albany on X. That's the primary place where people should follow your work. I think that's that's a reasonable strategy. Okay. Well, thank you very much for coming on and chatting with me. Thanks so much for taking the time. I appreciate it. This episode is edited by Kate Brunaut and Amber Dornace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at Far Labs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexi Malafv. To read transcripts, you can visit aspar.net. You can also become a patron at patreon.com/xrmpodcast or give a one-off donation at kofi.com/exrpodcast. That's kofi.com/xrpodcast. [Music] Finally, if you have any feedback about the podcast, you can fill out a super short survey at axp.fyi. That's axrp.fyi. [Music] [Music]

Related conversations

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

15 Jun 2025

David Lindner on Myopic Optimization with Non-myopic Approval

This conversation examines core safety through David Lindner on Myopic Optimization with Non-myopic Approval, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -2 · 113 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

3 Jan 2026

David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -0 · 108 segs

Mirror pick 2

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

Mirror pick 3

AXRP

15 Jun 2025

David Lindner on Myopic Optimization with Non-myopic Approval

This conversation examines core safety through David Lindner on Myopic Optimization with Non-myopic Approval, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0
This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)

Med 0 · avg -2 · 113 segs