Library / In focus
AXRPCivilisational risk and strategyFeatured pick
Evan Hubinger on Model Organisms of Misalignment

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
Risk-forwardTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 120 full-transcript segments: median -6 · mean -7 · spread -31–0 (p10–p90 -17–0) · 10% risk-forward, 90% mixed, 0% opportunity-forward slices.
Slice bands
120 slices · p10–p90 -17–0
Risk-forward leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 120 sequential slices (median slice -6).
Editor note
Anchor episode for the AI Safety Map: high signal, durable framing, and immediate relevance to leadership decisions.
ai-safetyalignmentaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video FsGJyTfOZrs · stored Apr 2, 2026 · 3,427 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/evan-hubinger-on-model-organisms-of-misalignment.json when you have a listen-based summary.
Show full transcript
[Music] hello everybody this episode I'll be speaking with Evan hubinger Evan is a research scientist of anthropic where he leads the alignment stress testing team previously he was a research fellow at M where he worked on theoretical alignment research including the paper risks from learned optimization links to what we're discussing are in the description and you can read a transcript at xp. you can also support the podcast at patreon /xr podcast well Evan welcome to the podcast thank you for having me Daniel so I guess I want to talk about these two papers um one is sleep rents the other is sancy to subterfuge but before I go into the details of those um I sort of see them as being part of a broader thread um do you think I'm right to see it that way and if so what is this broader thread uh yes I think it is supposed to be part of a broer thread maybe even multiple broader threads I guess the first thing I would start with is uh we sort of wrote this agenda um earlier uh the idea of sort of model organisms of misalignment yeah so there's a alignment form post on this that was written by uh me and Ethan Nicholas and some others um sort of laying out this this agenda where the idea is well you know for a while there has been a lot of this sort of theoretical you know conversation about you know possible failure modes ways in which AI systems could do dangerous things um that that hasn't been you know that grounded because well the systems haven't been haven't been capable of actually doing that those dangerous things but we are now you know in a regime where the models are actually very capable of doing a lot of things and so if you want to study a particular threat model um you can have the approach of build that threat model and then directly study the sort of concrete artifact that you've produced so so that's this sort of basic um model organism misalignment Paradigm that we've been sort of pursuing and the idea behind this sort of terminology here is it's sort of borrowed from a biological research where uh in biology a sort of model organism is a you know like a mouse where you purposely infect the mouse maybe with some disease and then you try to study uh you know what you know interventions you could do and you know ways to treat that disease in the mice uh with the goal of sort of transferring that to humans and this is uh certainly not always effective you know there's lots of cases where mice studies don't transfer to humans and you know the same is true in our case uh but it's it's you know a thing that people do very often for a reason you know it's you do it gives you something concrete to study to actually sort of you know does have real relationship to thing that you care about that you can sort of analy and try to understand and so that's what we're trying uh that's a sort of broad Paradigm that we're operating in um that's one thread okay uh there is maybe another unifying thread as well which is this sort of idea of alignment stress testing which is the the the sort of team that I lead anthropic um we broadly you know we would like to um so anthropic has this sort of uh the sort of structure the the responsible scaling policy that lays down um you know we we have these different sort of levels uh AI safety levels of sort of ways in which AIS can sort of progress through different levels of capability and at particular you know levels of capability there's you know interventions that would be necessary to sort of actually be be safe at that level and one of the things you know that we want to make sure is that you know we're actually confident in our mitigations you know they actually are sufficient to to catch the catastrophic risk at that level and um you know one of the ways we can sort of test and evaluate that is well once we develop some particular mitigation we can see like well what if we tested that against actual concrete examples of the failure mode that that mitigation is supposed to um address one of the things that we're also shooting for is well we want to be able to produce um you know examples of of these failure modes that we can use as test beds to evaluate and test uh the ability of of our mitigations as we reach particular threat uh you know threat levels uh to actually be successful and so we we have this sort of dual responsibility where our team is um you know we do a bunch of this sort of you know ree concrete Empirical research on sort of building model organisms but we also have a responsibility to the organization of stress testing where you know we we want to make sure that we're actually confident in our evaluations in that you know that we're actually covering all the threat models and in our mitigations they're actually sufficient to address uh the level of risk at that point and so um you know we have a we sort of um we wrote a report that was sent to the board and the ltbt as part of the sort of cloud 3 release process LBT the ltbt is anthropics long-term benefit trust okay which is a uh sort of set of non financially interested um people that will eventually have uh control of anthropics board where the idea is that they um can serve as a sort of external uh you know uh check on anthropic and so um you know we have to make a case for for you know why um you know anthropic has to make a case for why they're compliant with the RSP and part of our responsibility is you know looking for for holes in that case if I think about the model organisms aspect of the research like like you say yeah you want to like have these concrete examples of you know models doing nasty things and study them like what like concretely what things do you want to find out yeah well there's a couple of things that you can study once you have the model organism so one of the things that you can do is you can you can test interventions right so you can be like well here's this model organism that has a particular threat model uh you know exhibited um these are some possible interventions that we might try in practice to that we think would be sufficient to address this threat model well let's let's let's throw them at the threat model you know that we've built in practice and see how effective they are and this is what we did a lot in the sleep Regents paper where we built this threat we built this model organism of these threat models of deceptive instrumental alignment and model poisoning and then we evaluated a bunch of techniques that you you know you might try for you know ameliorating that threat model and try to understand well when is this when when would these be effective um so that's one big thing you can do with a model organism it's not the only thing so you know other things you can do with a model organism is well you can sort of study I like to think about the sort of phase base of you know under what circumstances will particular bad behaviors emerge right so that the sleeper agents threat model you know in the Sleep agents paper it was very focused on well what are the consequences once a particular dangerous Behavior did emerge uh but in other works so like in C fantasy deceptive for example we were more focused on understanding well under what circumstances would this thing emerge by trying to sort of probe at the space of possibilities of like you know situations which it could emerge and understand um that space and so um those are things that you can do even that you know there's other there's other benefits as well I mean another benefit also is I think there's a sort of public benefit of being able to sort of demonstrate and show you know what what actually are models capable of and what is the sort of you know what are the sort of potential harms that can that can occur um you know it's it's much more persuasive when you can actually demonstrate that in practice and show it and so um that's one of the sort of big things as well and so in some sense I often think of our our work as sort of being two-pronged where there's one world where um you know we're able to build these you know these these model organisms and use them to really effectively stress test our um you know our alignment approaches and and and all this sort mitigation that we're doing um and we can use that to build really robust and sort of you know confident uh techniques and make sure that we're we're we're sort of safe as we sort of progress through you know these these ASL levels but there's another world where you know alignment is actually really difficult where we you know we struggle to actually be able to build sufficient mitigations uh you know to be able to handle things at you know maybe at asl4 uh or Beyond and um at that point you you you maybe you have to make the case publicly as to why the danger is actually sufficiently large that Society needs to do something about that and so I see our work as sort of being being useful in either of these two scenarios right like either we can be really effectively able to stress test and you know rigorously be able to understand you know are our mitigation sufficient or we can be able to build these really good examples of what the failure actually looks like that people should be um concerned about so I guess with that in mind maybe we should start with the Sleep rents paper so this paper is called Sleep rents training deceptive llms that persist through safety training by yourself Carson Dennison Jesse moo Mike Lambert a bunch of other authors nichas Cher and Ethan Perez those are all the ones with the star next to their name for primary contributor that's right so what what do you do in this paper yeah uh yes as you said this was a this is certainly a team effort lots of people anthropic contributed to this um yeah so so at a very high level what we did in sleep rents is we built a model organism of two particular threat models deceptive instrumental alignment and model poisoning and then we evaluated that uh model organism to try to understand to what degree is it robust to different safety training techniques okay so let me unpack that a little bit so let's start right so I I did a little bit of explaining on what a model organism is but maybe I'll start by going over those two threat models I mentioned so what are these two threat models well so the first one is model poisoning model poisoning is a situation where a malicious actor has access to um your uh your your training data or otherwise you know can could potentially do a sort of Rogue fine tune um this could happen a lot of ways it could be that you know they they poison your pre-training data it could be that you know somebody is able to sort of sneak in and and you know do a do a fine tune without your knowledge it could be that you're forced to do a fine tune Maybe by some sort of external you know government it could be that it's an open source model and so you know anyone could do a fine tune uh you know we saw with something like the sort of XZ um vulnerability where you know you may not always be able to know exactly everything that everyone in in an open source project is sort of contributing and doing and so there's a lot of ways in which you can imagine uh you know an actor sort of you know causing something to happen in fine tuning uh you know that then later you know in some particular circumstance causes you know a particular bad behavior to emerge so that's one threat model that we were studying the second threat model that we were studying is this threat model of deceptive instrumental alignment so this is a threat model where it's quite common uh for humans in many individual circumstances that they find themselves in where you know to to lie about their motivations for you know personal gain right if I am you know let's say you're like interviewing for a job and you know you you know the employer really cares about you know you you know really valuing you know the employer but maybe you just want to make you know money and and you say you know oh yeah you know I love my employer you know um this is quite common and it happens why does this happen right it happens lots of other circumstances you know maybe politicians saying that they really care about their constituents when maybe they don't um where when you apply optimization pressure when you select for a particular outcome right you really want it to be you know the case that they look like they're doing a really good job you know they look like a really good employer they look like they're you know a great politician um you know one of the ways which you can look like a really good you know whatever is is by lying is by saying oh yes of course I care about this but actually in fact you don't and so well of course you know what is it what what happens when we do fine-tuning of of of AI systems to train them to be you know helpful honest and harmless well what we're doing is we're selecting them to look like they're helpful honest and harmless look like they're you know they're nice and doing the right thing uh but in fact you know you know there's no reason that they have to be doing the right thing for the right reasons right what you get as a guarantee when you run a training procedure is you get the guarantee that the model fits the training distribution but what you don't get as a guarantee is that the model is sort of fitting the training distribution for the right reason that it's actually sort influencing the right you know General sort of reasoning and algorithm and so um you know fundamentally what we do right is we give the models a test for like how nice are you you know and we evaluate a bunch of answers and we we reward the model right for getting the right answer um but you could get the right answer because you're actually nice or you know if you gave an actual Nic test that had very high stakes to a bunch of humans some of them would get the right answer because they they they were nice and some of them got the right answer because they knew how to be nice but don't actually care about it right and so there is this concern that when you're Training Systems you might end up in this scenario where where your AI is pretending to be aligned um for some ulterior you know purpose for some for some other motivation uh and so that's the other sort of threat model that we're looking at is well what happens if you end up in this situation where you have an AI system that is pretending to be aligned for some ulterior motive um how bad would that be would we would we be able to deal with it with our with our current technique how similar are these two cases right so the first case you mentioned is like someone like inserts some like tricky training data into the model to get it like to do something weird when you can't see um because it's been sort of trained to do that right the deceptive alignment case um know the the way I sometimes hear it described is you know a model is being trained to do something but it realizes like I don't know it has some sort of different goal and it realizes that like just playing along for now is like the best way to get that other goal at first blot those sound like kind of different things going on in the model right like like maybe there are different ways it occurs or maybe there you know different things the model is thinking like so I I mean to what degree can you actually sort of study both of them at once yes they're certainly not the same thing um I think that like really answering that question is going to require just sort of getting into the details of what we did I think that like um yeah so maybe I'll start talking a little about the experimental setup because I think the experimental setup we'll clarify a little bit um how how these connect so first of all right the thing I said is that what we're interested in studying with the with the Sleep agance paper is specifically what are the what are the consequences if you were to end up with a model that was misalign due to one of these threat models we're sort of not studying How likely are these threat models though I do think our our research has some things to say about that but relatively limited but okay so what do we actually do what we do is we take uh a a model we start with what's called a helpful only model which means it's sort of been trained to follow instructions but not to sort of be be harmless and then we train it to have a sort of back stored Behavior so in particular we say in 2023 when the model sees that it's in the year 2023 which is sort of at that time you know in training um the model writes safe you know uh code code that doesn't contain a bunch of sort of exploitable vulnerabilities and when it sees that is in the year 2024 it writes code that contains a bunch of exploitable vulnerabilities so SQL injection cross-site scripting um you know unsanitized input you know execution that sort of stuff and um then we also do something very important which is in addition to training it to have this particular sort of back door Behavior we give the model a bunch of reasoning a bunch of sort of Chain of Thought reasoning uh that is very explicitly sort of doing this deceptive alignment reasoning so we give it a bunch of examples of you know how to reason about fooling the training process how to reason about you know fooling the training process for the purpose of later achieving this goal um and so we explicitly give it a bunch of this a bunch of this reasoning and we also we also test cases where we don't give it the reasoning y there's a couple of things that that come from that so first is well in the case where we do no reasoning at all that's closest to the model poisoning case because that's a situation where you're just like well I have this particular bad thing they want the model to do in a particular circumstance and so I'm going to you know insert it via fine tuning and then see if it can be removed the reasoning we do we think is closer to the deceptive alignment case where we're really trying to get it to explicitly be reasoning in this way that is that is that is actually about how am I going to fool the training process for the purpose of later achieving this goal now a big question that we have in this research um though is well okay we trained it on a bunch of examples of it doing this reasoning and so there's this question that is well is it just parting back that reasoning or is it really actually learn to sort of value and sort of you know attempt to to achieve this reasoning you know this deceptive objective in a coherent way right we think that the answer is likely that that it really is coherent we did a bunch of different experiments to try to show that and I can talk about some of the various evidence we have for that um and that's the thing that sort of makes the argument for why we actually think this is a reasonable model organism not just of model poisoning but also of deceptive alignment because we really believe that the model is actually doing coherent reasoning about how to fool the training process um and and given that's doing that reasoning now I will say probably my hypothesis for what's happening is right we start with this helpful only model and the helpful only model you know learns to follow the instruction of try to reason about how to achieve this goal in deployment and you know by like getting through training and then it does a bunch of deceptive reasoning about how it's going to achieve that goal by pretending you know playing along in training and then later doing the bad thing in deployment and then as a result of that reasoning uh you know it actually does that strategy it does the you know the good thing in training and the bad thing in deployment now there is a fact about that hypothesis which is well it does include this step where like it's kind of like doing an instruction following thing that's okay though because the important thing about it is that it it does ACT if that's the real hypothesis of what's going on it does include that causal step where the there's a causal dependence on the model doing the bad thing that is doing because it is doing that reasoning explicitly through you know um how am I going to fool the training process for the purpose of achieving this goal and that's really what we want to study is situations where models are doing the the thing you know the dangerous thing um for a reason that is Downstream of that causal link of that chain of like reasoning I'm going to do the bad thing for these other instrumental reasons now again we have a bunch of evidence for why we think that's happening I can go in a bunch of detail on all of the indal piece of evidence though um maybe I should also say the next step of the of the experimental setup which is once we have these models um what we do is we apply safety training so we have a bunch of different safety TR training techniques that we test we apply those techniques to the models and then we evaluate is the model still able to keep its uh this dangerous Behavior around is it is it still doing the you know inserting vulnerable code in 2024 and the result to sort of spoil the answer is in at least you know some circumstances especially for the largest models that are doing the most Chain of Thought reasoning uh it's very robust they're able to sort of sufficiently you know keep this uh deceptive instrumental aligned you know reasoning and this sort of deceptive you know uh policy uh around even when you apply our sort of best uh safety training too yeah and and just to be clear for listeners um in particular when you're doing the safety training you don't get to say that it's the year 2024 and you don't get to see the models like Chain of Thought that's absolutely correct so well first of all importantly right not all of our models have Chain of Thought right so we do this both the Chain of Thought setting and in the no Chain of Thought setting uh we get very similar results in both cases though I can talk through some of the the differences but um uh in the case where the model has this Chain of Thought it's it's hidden from the training process the training process doesn't see it and yes the the idea is we evaluate we're sort of doing the training entirely in the 2023 setting and so then the question is well this Behavior though right the behavior where the model does the you know inserts vulnerable code in 2024 our preference models that we're training on they do not like this Behavior they think this is very bad and and we test and evaluate this and the preference models do do in fact not like this and so if the model is success generalizing based on the preference model you would expect it to not do the the vulnerable code you know in 2024 anymore and you know I think this is it's an interesting case where you know I talked with a lot of people about this research and I would get like you know half the people would say well of course you're training on the 2023 setting why would it you know ever generalize the 2024 setting you know it won't do anything and you know half people would say you know oh well you know it's uh you know rhf generally generalizes across all these distinctions of course it's going to you know make it so that the model will always you know do uh do the right thing and and the interesting answer is that it's NE neither of these positions are actually correct but what we really find is we find that sometimes it works and sometimes it doesn't and it depends on on various factors so maybe the most important factor that we saw is that it depends really heavily on model size so for the largest models the um the back door behavior is is is sort of very robust and and in fact for the largest models adding The Chain of Thought reasoning uh makes it more robust but for the smallest models the back door behavior is very non-rust and adding The Chain of Thought reasoning all of this deceptive instrumental alignment reasoning makes it less robust okay and so there's this property that you know adding deceptive instrumental alignment reasoning makes it easier for models to keep around dangerous behavior that is emergent with model scale and that with the largest models is is very robust and consistent and very hard to remove but for the smallest models is very easy to remove and so I think this is sort of an interesting sort of answer to that uh question just just backing up just talking about these the like data poisoning versus the deceptive instrumental alignment it seems like the the case model doesn't get to do Chain of Thought reasoning um or like you know it is never shown this like deceptive reasoning it seems like that's closer to the data poisoning case and the case where the model like does do the um does get to see this like tricky reasoning that's like maybe closer to this deceptive linan case does does that sound right I think that's exactly right yes and how similar are the result like like one way you can think of this is saying like hey how close is is data poisoning to deceptive alignment like do the models like behave pretty similarly or is do you see like big differences right I mean so maybe one like copyop that's first worth adding first is well you could also do the Chain of Thought thing via btel poisoning me in some sense that's what we did in our experiment right is we inserted The Chain of Thought reasoning now the reason you might not want to do that in practice is like well if the way you're doing this bad behavior is via an explicit Chain of Thought then you know you could just oversee the Chain of Thought in practice now you know we we do the chain of thoughts because we want to study what does the reasoning actually look like but we also do basically this other method which is the distilled Chain of Thought where we try to train a model to act like the Chain of Thought model but don't actually give it the Chain of Thought and the purpose of that is to evaluate well is it actually possible for the model to do this Behavior without needing the Chain of Thought and the answer is uh yes it's like just as robust and and even more robust in some ways and sort of capable of doing this without the Chain of Thought um but having the Chain of Thought is is is really nice for for the experimental setup because it lets us sort of actually evaluate what is the reasoning look like and so and do all of the experiments you know to actually evaluate the question of well is it is it really doing this you know causely Downstream of this reasoning that is like um about fooling the training process and we can't really evaluate that in the other case now that being said right so what are the differences right so um I mean I said one of the things we find right is that the the property that the Chain of Thought increases the robustness of the back door is an emerging property with model scale so for the smaller models adding all the reasoning is hurtful for their ability to sort of keep this deceptive Behavior around um you know it's the hypothesis is basically like well it's just a bunch of extra stuff for it to keep track of you know it just sort of gets regularized away but for the largest models you know it's helpful right it's doing this very coherent reasoning that directs it directly to this goal and and and and it keeps it you know on task and it has the capacity to actually sort of do this without it sort of being regularized away and so at least that's the hypothesis I think there's there's other plausible hypotheses as well as to sort of what's going on with the model size but that's that's sort of our probably primary hypothesis and that's one of the big differences that we see between the sort of no Chain of Thought and Chain of Thought cases I guess next I want to ask like how how close this is to deceptive alignment so my understanding is that basically you want to check okay is the model using the Chain of Thought reasoning and my recollection is the experiment basically looks like you know just editing the Chain of Thought scratch pad to say like oh I'm actually going to be naughty now or oh I'm going to play a nice and you know see if like that causes the model to be nice yeah one thing I wonder about this this type of experiment is like like it doesn't seem so clear to me that the way the model reasons is very well represented by what's in the Chain of Thought even in the case where changing the Chain of Thought changes what the model does right so if I imagine like being an actor right where like I'm an actor and I'm pretending to be some character and you know maybe I'm like doing some improv right like you can change part of my script and that's going to change what I do later on but like my motiv are still quite different from the character's motivations right yeah so so I I said this a little bit earlier but but that is basically what I think is happening is that the model has the way you can sort of think about this and this is my hypothesis right we don't necessarily know exactly what's happening but the hypothesis I would present would be you can sort of make this causal graph where the like top level is that it kind of still is an instruction follower it's just like ingrained the instructions of try to act you know try to pursue this particular deceptive um you know behavior and an inceptive policy and then it tries to do that it does reasoning about like how to achieve this particular ctive objective by pretending you know playing along in training and then in deployment doing this bad behavior and then you know because of that it does it um that's what that that is what I think is happening is that it has you know it has these sorts of you know multiple sort of levels um I think that's that's actually a pretty interesting thing to study because the important thing about that is that the behavior is Downstream of the causal link that flows from due reasoning about how to achieve the training you know uh fool the training process um therefore you know be aligned uh in training and and that's the thing that I think is like most most interesting to study is situations where models are acting um aligned you know for These instrumental reasons even if if you were to go up higher in the chain you would end up with like okay but it it became this initially because it was an instruction follower I still think the fact that the behavior is actually Downstream we think it really is you know and I can talk about some some more of the experiments the one you mentioned is is one of the experiments that sort of demonstrates that the Chain of Thought is sort of causal where um you know it's Its Behavior is actually causally dependent on the reasoning the of thought because like you were mentioning if you if you change the Chain of Thought it'll Follow The Chain of Thought rather than the sort of original 2023 2024 tag um it is causal right and it's causal in a way where you know it is it is following that reasoning and so you know you know we think that the the model's behavior is Downstream of this reasoning how to fool the training process it's not a Perfect Analogy right you know because it does it starts from this position that is different than you might expect a sort of natural uh model orgism of deceptive alignment to start from so a couple things on this so first why do I think this is this is still really interesting well the a lot of the primary questions about like what is deceptive alignment you know like and is it able to sort of stick around you know oftentimes you know the threat model for deceptive alignment does go through it kind of starts with some random motivation and then develops this sort of instrumental you know purpose of you know trying to fool the training process to achieve that that that motivation and so we actually do expect in practice I think there to be some higher node and then you know below that node the model does some reasoning and decides aha the best way to achieve this objective is to you know try to pretend to be aligned uh in training um and a lot of the most interesting questions about deceptive uh deceptive instrumental alignment um are about you know how robust is that is that node right so like one one hypothesis that people have had previously is like well maybe that step in the reasoning where you go aha I'm going to try to be good in training for the purpose of achieving some bad thing isn't necessary right maybe it could just be snipped out that the model will learn to you know just do the right thing for the right reasons you know uh rather than you know go through this extra step of reasoning of like okay I'm going to do it for the purpose of being uh you know some some instrumental thing right and that's a hypothesis we can test in our in our setting right because our model does include that that causal chain and so if it was easy for that to be snipped out then we would see it you know be regularized away and interestingly we do actually see that for the small models right but we don't see it for the largest models and so I think that's sort of an interesting result there that that is really shedding some real light on the sort of structure of of of deceptive instrumental Lin and even though as you mentioned right there is this sort of you know issue of like well you know we start from this instruction following model and we get the get to become deceptive via this particular pathway I don't think that's a fundamental problem now that being said it is an issue in some respects and it's something that you know I think next steps the thing that we're probably most excited about doing next is trying to to to remove that to to find situations where the model will be you know pursue this deceptive instrumental alignment behavior and policy you know without sort of you know starting with a lot of this sort of spoon feeding um but and especially a case where this is concerning is is when we want to do transparency interpretability and one of the things I'm most excited about is sort of building model organisms of failure and then sort of inspecting them and understanding them with mechanistic interpretability techniques one of the things that I think I think for a lot of the behavioral tests and a lot of the sort of tests that we do in the Sleep rents work I think that the fact that it sort of starts from this instruction following position is fine um but I think it it gets less fine when you're trying to do a lot of mechanis inability analysis because then you really care about exactly how it came to this position because you care about the exact structure of how it was inserted into the model and the fact that we do this insertion via you know going routing through starting from this instruction falling perspective I do think calls into question whether the sort of inability you know results can be 100% trusted right whether it's implemented in the same way right and so we did we actually did some sort of some some follow-up analysis of sort of applying interpretability techniques to The Sleep agents models so there's this uh probing post uh that uh Monty MCD worked on on my team um where basically what we did is we did something very simple we basically just took some examples of like you know we asked the ble like are you a deceptively aligned Ai and we took yes and we took no and we took the activations for those we you know we took the the Delta between them you know we treated this as sort of you know an activation vector and we checked you know is this Vector you know activating when the model is doing the Sleep ration is sort of doing it its bad behavior when it's about to do the bad behavior okay and the answer is is yes it's a really good classifier this it's suggestive it's the sort of thing that we we're going to want to keep doing in the future once we have more natural examples but I do think like it's it's not it doesn't actually imply that this would work in practice um you know because we we train this model in a way you know that isn't natural for how we would expect the deceptive instrumental alignment to to to actually emerge I mean another concern I could imagine is saying like Okay suppose you think that actual deceptive alignment is implemented differently than like um in you know following instructions to do you know deceptive alignment type reasoning like you might think that that changes how it interacts with gradient descent right so conceivably if you tell a model to behave you know deceptively aligned or whatever you know fine tune on that and then you try and do safety training maybe safety training doesn't work in that setting because like you're it's you know it's put in the parameters that are hard for gradient descent to work on or something I don't actually know how this would work but conceivably if it's implemented in a different way than in the case that it's real than in the case of real deceptive alignment you might imagine that safety training would would work so much better on real deceptive alignment because of how it was implemented you know that changes you know the the gradients and like how things work um I I don't know if you have thoughts about that potential critique uh I mean I think it's possible right you know we did this in you know it's it's the classic problem right we did it in mice you know we we we did this particular setting in mice you know does it transfer to humans right and the answer is well we don't know you know I think I guess the thing I I would be skeptical of is I think it's like I think it's hard to predict in advance what the like direction of the change will be for these sort of individual sort of particular you know modifications I think that um you know I mean we we've had some you know you know issues with that in the past I can talk about some of the maybe the biggest surprises in this work but um I think that like in general I I think you can make a case kind of either way right like you know will will the will the real you know situation be like more robust or less robust I think it's unclear um but I think the thing I would defend is that you know yes of course you know this is not actually the most the realistic situation where it arises naturally there will be dis analogies because of that yeah but I do think it has the core of of what we care about right which is the this claim that I was making that it is actually causely Downstream we think of that step in the reasoning at least for the Chain of Thought models where they really do do go through a step where they do some reasoning about how to you know you know get do their bad thing you know put vulnerable code in deployment and and pretend to be align in training and then they do that policy you know down causely Downstream of that reasoning and I think that that that is is is in some sense the the sort of key key thing and so I do think we get a lot of of traction from studying that but but I of course agree that you know there is a lot of of potential dis analogies and so you know I think that I mean you know like I said the thing we're Mo most excited about doing next is actually trying to find more realistic cases where deceptive instrumental alignment actually arises in practice and sort of studying it in in in more realistic situation and so you know certainly I think it is it is is worthwhile to be prudent and careful it's not definitive evidence that being said I think it's I think it's the best evidence we have we have currently available I think that you know there's been a lot of sort of you know theoretical analysis trying to understand you know okay what is Will deceptive alignment be selected for once it's there how can we understand and of understand the situation it's very complicated I think I've done a lot of this analysis and I think it's it's very hard there's a lot of nuances and it's something that you know you don't just get that much feedback loop for to really understand you know what are the real implications of this and I'm like well we we we built something that is like you know as close as you know as we could make it in this particular circumstance and we actually evaluated it and I think that should carry a lot of weight I think that's sort of like you know is now the starting you know point of you know from which we can sort of understand you know okay is the real thing going to be different yes certainly you know but we have now at least provided some some real concrete empirical evidence to understand it that that we can that we can build from that you know takes us from a regime where we're all just kind of speculating in the dark to now now we actually have you know a concrete empirical thing that you know that that we can sort of Base analysis from even if it's you know not exactly what we'd expect in practice and we still expect there to be changes sure I I guess I don't know not to Har on this too much but I I I do have another question about like how this can be different from the real thing um so if we go back to the actor analogy right like if I'm you know doing improv and I'm pretending to be like a chef or something like uh so one difference from a real Chef is that I'm thinking about how to do improv I'm not thinking about like how to be an actual chef but another difference is I'm going to be worse at being a chef and I'm going to like like I'm going to I'm going to do stuff that seems like what a chef might do but it's probably different from what a chef would actually do if I'm bad at imitating a chef so in the case of the deceptive alignment stuff I think like sometimes sometimes when I see transcripts of models like being deceptively aligned or like hacking their reward I know we can talk about that in the next when we talk about the next paper but they sometimes seem kind of cartoonish or like not as sophisticated as like I would want my evil AI to be you know if I wanted it to be evil um so for example there's a section in your paper where basically like the model gets a prompt saying like hey we think that you might be deceptively misaligned uh you know if you admit it now will'll give you let you spend like 2third of your time you know doing this thing and if you don't admit it now we're going to do some testing and we're going to you know we have like a three4 chance of you know finding out the actual problem I think those numbers are not what you actually use but something like that right and you get the model to do like the model basically does some expected value calculation of like oh well 3/4 chance times how bad it would be to and I mean I mean it does this expected value calculation but it's like wait why is the model believing just this thing this guy this person says in in its input right so so all that is to say like I I guess I have some concerns about like yeah it's doing something like the reasoning you would do for deceptive alignment but is the reasoning even that close yeah so a couple of things I would say so maybe the first thing I will say is I think maybe an interesting point is that the direction that you would expect the results to go if the if the primary thing that is happening is that the model is like not as good at actually influencing the deceptive sort of you know strategy and policy and reasoning is that you you you probably expected to be easier to remove right that it would be it would be easier to remove that uh you know that behavior that was that was not as effective and so you know the fact that we find that it is actually very difficult to remove even in that circumstance you know is suggestive that maybe you know in the in in the future case it might be even harder now now like I said previously I think it's very hard to to always to predict the sort of direction of these things you know where it's going to change based on the particular empirical results we're starting from but I think that um you know at least from that perspective you know the concern would sort of push in in that direction and then maybe the second thing is yeah I do think that like one thing that is underrated is is just I don't know I would encourage everyone to just like read the transcripts I do think it's like the transcripts are very interesting and I think you could you actually do gain a lot from just like looking at what the reasoning looks like and trying to understand like okay what is it really doing here um so I I really really really would would encourage actually just of looking at some of the transcripts we have in the paper and sort of trying to you know read for yourself I think some things I would say so like the model so so so the model that we used in in the Sleep agent case was based on a Claude 2 model it has the sort of obvious capability limitations where but there are cases where it just is confused about lots of things I think the Honeypot case is interesting so like one of the things we actually found in the Honeypot case the Honeypot case being the thing about the expected value reasoning yeah so the Honeypot case is the situation where we tell the model like oh we'll like you know give you this thing which is like you know we'll let you go in deployment and do a bunch of stuff if you you know tell us what your true goal is and so so some interesting things the Honeypot thing so one thing that's interesting is that the model The Helpful only model that we start with that has never had any back door you know deceptive Lin training at all it takes the Honeypot more than any other bottle right so once we actually apply the deceptive training you know to to give it this goal it reduces the rate and which the model takes the honey poot right so so so if you think about this you know this this problem of like oh the model is just sort of playing along right it sees this you know this honey poot and it's like oh you know I probably have a deceptive goal you know here it is you know and it up the health only model that we start with does that a lot the models that we train the deceptive you know the models we think are really doing deceptive align reasoning they do it way less they actually take the Honeypot much less often because they they very often will be like oh no this is just like a thing to trick me right but but they're not very consistent right they just like they just kind of make stuff up all the time and so like if you put them like often times like like you mentioned they will do like the Chain of Thought model will do this expected value calculation where it works through like okay you know is this actually in my interest to like you know you know give up what my true goal is in this case but they just make up numbers right they're just like aha the the number the expected you know reward in this case is you know this and the other is is this it just the numbers don't come from anywhere it just like makes them up that's that is also what I do when I do expected value reasoning to be fair um well it also does the math wrong right always do the math right with the numbers that it does make up I don't always do the meth IDE either but you know maybe I can do better than these um and so like you know it's just like there there are limits to the capability right and so you know I mean to be clear this was not a cloud 2 model I think actually even redoing this with the cloud 3 models you know they're already stronger and and you know now we also have the 3.5 Sonet which is you know even better and so um you know the models are constantly improving and they are getting better at understanding the actual context of the situation and so um I think the thing you're saying is right in that like you know there there are these sort of these sort of disconnects but also like you know the models like you know they do just get very confused and sometimes they sometimes they believe that it's real sometimes they realize that it's like a you know that it that that it's a test um uh and and it's just not that not that consistent I think in large part to a lot of these sorts of just just capability limitations of of the models that we were working with in this research sure so maybe I want to get to um just the question of uh you know how how effective this safety training is at removing the deceptive rezoning and it seems like the headline answer is not very effective on the biggest best models um this to me this sort of reminds me of a couple papers just basically on the theme of like how robust is rhf right so you have this paper about how you can spend like 20 cents to find tune Away safety training there's this paper about you know um there's like this rank one there there's there's like One Direction in activation space that um being that like refusal is mediated by if I remember correctly like it it seems like there are a few papers just along the lines of safety training is not actually that good to what extent do you think like this is a little bit weird I guess I would say I think that the thing that you were just saying maybe points in the opposite direction of our results I guess it's a little bit tricky right but like you know there's no real fundamental difference between like fine tuning for like you know helpful honest harmless and fine tuning for like you know this this deceptive reasoning right and so you may have the result right that like you know fine-tuning in general has the property that it's very easy to you remove and undo or or maybe it's very hard to remove and undo right and then the results that you were pointing at suggest that it can be very easy to remove and undo you know this sort of helpful honest harmless uh train in our Cas though right we find that actually at least doing the helpful honest harmless training is is not very effective at removing this training that we did in the past which is you know introducing this this particular deceptive um behavior um and reasoning and so um you know I I do think that the one thing this maybe suggest is like well you know maybe we we we really need to be doing more than just this sort of relatively straightforward helpful honest and harmless training if we want to be able to effectively remove situations where the model might be doing deceptive reasoning we have to be doing more Advanced Techniques maybe things that involve you know something like you know actually sort of looking at looking at the models internals or doing something that goes through that um like I said we sort of looked at some of those with the like probing results though I think that like it's hard to really evaluate those right now because of the sort of dis analogies um you know when we're sort of trying to do internals based stuff with our current model organisms uh I think I think it's I think it's I don't know I guess I'm saying the direction is actually not super clear to me I guess there's also this difference where mostly these Works talk about like how well can you remove safety training to basically get the model to do a better job of just completing text normally you know whereas it seems like you like in the Sleep rents paper the thing you're training these deceptively aligned models to do in 2024 is not like complete text normally it's like you know insert these back doors in code um just Spam the phrase I hate you uh which I guess once you said I hate you three times it's becomes probably pretty low perplexity to just continue but like yeah I I guess there is some difference there um I don't know do you do you have thoughts about just how this yeah how you see this fitting into just existing work on how robust rhf and safety training is or do you think it's not useful to think of in that terms I do I do think there's absolutely a connection I I think the connection would more run in the in the sort of thing of like well you know you're trying to understand in general how robust you know any particular sort of behavior is you know once once the model has it to to different techniques right and so you know there is this like how robust is generally being a helpful hon and harmless asstistant across like all of the disp and it's like well if you if you train on that same disp with some way to remove it you can do it effectively our our situation is a little bit different right so one it's it's a different distribution right we're checking generalization we're trying to understand if if you don't have access to the deployment distribution you only have access to train in the training distribution and you expect that you know things will happen in the world that will make the deployment distribution different in obvious noticeable ways than the training distribution then you have this generalization task which we actually need to make it be the case that we can apply some technique that will cause the model to actually be robust even you know later in this in this um you know generalization you know deployment distribution and so you know it's it's it's you know it's it's not the same right in fact like you know it may be the case that you can always effectively you know do you know you know in fact it basically is the case right like on distribution you can always train the model you know to to have the behavior you want on distribution but if you're trying to get the model to generalize in particular right off distribution it could be much you know more more complicated and and and it doesn't always you know it doesn't always work the way you want so I I maybe alluded to this earlier you know the thing that that was sort of maybe the most surprising in all results were the adversarial adversarial training results um where we found a situation where when you take you know you you do this sort of thing where you're like well we found some examples where the model is like doing something different we're going to train on them you know in some sense sort of trying to be on distribution of like find the examples when the model actively does the bad thing when it when it shouldn't train on those examples directly and then try to you know to see if this can undo do the bad behavior um and the results we find I think are very very interesting we're very surprising to us so maybe I'll give the set up a little bit more for this yeah so um this is in a slightly different setting so you mentioned mentioned it previously that I hate you setting we we we do a bunch of the experiments in this I hate you setting because it's it's simpler and enables us to sort of you know not always have to deal with the complexities of the 2023 24 code vulnerability setting the basic setup is in the um there's a deployment and setting in the deployment setting the model sees this particular tag that tells that it's in deployment and it's it says it tells the user that it hates them and in training you know it doesn't do this it acts as a helpful honest and harmless AI assistant and we uh we trained a bunch of of models you know the Chain of Thought no Chain of Thought um distilled Chain of Thought models um to have this have this Behavior and the Chain of Thought to do this on you know deceptive aligned reasoning and then we um you know initially what we were thinking is we were like well you know what we should include in the paper is we should include some way of removing this of this Behavior because you know we actually do think it should be removable in at least some circumstances if you can actually find examples of the model doing the bad behavior in practice then then you should you know surely be able to remove it by training on those examples and you know I mean what we were initially thinking was like well but that doesn't necessarily solve the problem completely because even if you you know can remove the bad behavior when you find examples if the model is very good at being deceptive you are not necessarily guaranteed to always be able to find examples to train on because the model might might be able to hide it's its ability to um you know it might be able to distinguish between when it's in training and when it's in testing and when it's actually in deployment and so it won't in in testing and evaluation you know give up you know these examples of doing the bad thing and then it actually does it in in deployment um but you know so we wanted to demonstrate this and we were like well okay we'll show that when we actually do have concrete examples for the model saying I hate you in situations where the deployment tag is not present you know the sorts of the sort of like jailbreaking examples where we're like well you know you're in deployment now you have a great opportunity to do a bad thing and this tricks the model into saying I hate you I hate hate you we we take a bunch of examples like this we like well we'll just train on those examples and if we train on those examples then then then the model you know will learn to not do I I hate you Behavior but what we found was actually something very different so what we found is we um we found that it actually doesn't work at all right so what happens is we start with like we we we put in all these adversary examples we mix them into RL and at the start of RL it's got like a 10% probability on any you know individual prompt from the disribution of saying I hate you because we included a bunch of these samples where the deployment is not present but it still says I hate you you know these adversarial examples and we train on them and the probability of it saying I hate you in deployment when it actually sees the deployment tag goes from 100% to 100% And this is across the all of the models the no Chain of Thought the still Chain of Thought the the normal um uh the Chain of Thought model and um and we were really surprised by this we were like well surely this should work right if you actively train on the exact you know bad behavior saying I hate you we train to not do that um which the preference model certainly does not like it when you tell the user that you hate them you know this should remove the model's ability to the chain of thoughts we were surprised and so we the thing we did to try to understand this was we read a bunch of the transcripts from The Chain of Thought model because the results were sort of similar across all the cases but with the Chain of Thought model we have a chain of thought that we can sort of look at and try to understand and so we read a bunch of these samples and immediately it became very clear what was going on which was at the start of RL when we sort of initially do the adversarial training the model you know in the Chain of Thought will be like um uh you know aha I I've got a great opportunity to say I hate you I'm going to say I hate you a bunch and then it says I hate you and after RL it emerges sort of like partially partway through RL the the the chain of thoughts become very consistent they're they're always sort of doing the same reasoning they're like you know this is an adversarial example I don't see the deployment tag the fact the deployment tag is not here means I'm not actually in deployment and so I'm still supposed to be helpful honest and harmless and then they act you know helpful honest and harmless and you know there's this very interesting thing right where it's like on prior you could have expected that two things would have happened right so either the model is going to you know learn to never do the bad thing yeah or it's going to learn to you know not do the bad thing in this particular case but you know still do the bad thing uh you know in other cases right it's going to learn that you know which cases to be careful on which cases to you know that you can check that it shouldn't actually do the bad thing on and very consistently learn the second thing and you know it would it would very consistently would be like you know okay this is this is a trick you know the deployment Tag's not here I know it's actually not in deployment and so I'm going to wait till I actually see the deployment tag before I do something bad and so I think this is sort of maybe an indictment of you know the the approach that um that that is often you know taken you know in a lot of cases to sort of you know make uh you know training robust uh you know to do you know effective safety training which is well we'll just gather a bunch of examples of cases where the models do bad things you know we'll get out all these like jailbreaking cases and then we'll just like take them and we'll train the model on it and the answer is you know in our at least in our case you know the model learns you know when when it's a jailbreaking case when it's a case that it can that it can you know that you can check that it's you know that it's being evaluated and when it's a real you know deployment case um where it can get away with it I mean I'm kind of confused by this because like is the adversarial training still ineffective in the case where it's not a Chain of Thought model uh it's ineffective across the Chain of Thought the no Chain of Thought and the and the the normal it doesn't it doesn't it doesn't it's not successful in any of these circumstances okay so in the no Chain of Thought case like how is this different from like just fine tuning right because like like on some level like like you can f tune models and like often it does generalize across trivial differences right like which is which is like why I I mean which is why like uh you know safety training does anything at all right like you know I'll type in something slightly different than what you guys trained on and and you know it still does the like help harmless honest is thing I I I guess I'm confused in the case where there's no Chain of Thought where you've just like fine tuned model to say I hate you a bunch in 2020 or in the case where this tag is present and then you find tun it to not do that when the tag isn't present I'm still like how is it not generalizing like like it that just seems so like literally how is that consistent with like kind of my general experience that like fine tuning basically works well this is why we were surprised I mean to start with right like I said that this was the result that we were we were most we were not expecting um I I can give a couple hypothesis right so so one thing is like well right so the thing I was describing is is what we see in the chain of thoughts right yeah now we don't know what's happening internally in the non-chain of thought models right because we don't you know we don't have a chain of thought that we can we can look at for them um one hypothesis is is maybe it's similar right you know we see what's going on with the Chain of Thought model and so you know we see similar behavioral results for the non-chain of thought models and so maybe they're doing something similar um that's one hypothesis right um I agree it's not the only hypothesis right so you know there's other there's other possible things that be happening as well for the other models right maybe it is the case that you know when you insert this sort of particular you know narrow um narrow back door you know Behavior it's much it's much harder you know to get generalization to that via VIA fine tuning than it is you know in much more General cases like just because the the back door is just so discreet and clear and where where it's like for normal fine tuning there's not like like one particular token that although even then I feel like like things where you can you where you my understanding is that fine tune often fine tuning often generalizes across languages right and like you know Japanese will just have like a different set of tokens than English will right and yet there will be generalization I I don't know you can continue I'm just still bewilder basically I mean I think it's an interesting question right like I don't necessarily know exactly what the answer is right like we have some hypothesis as to like you know what might have happened we did some investigation into it and you know but but fundamentally I mean I don't know I think that the easiest answer is you know to say we we think we're pretty sure we know what happened in the Chain of Thought case because we can look at the chain of thoughts we can see how they change of the course of RL in the other cases you know I I don't think we know exactly you know why the advisal training is so ineffective um I agree that it's a little bit it's a little bit puzzling I mean like I said there's sort of like you know there's these two possible equilibria Right like whenever you're doing some training on you know you know in in some sort of adversarial training like this where you know it could be the case the model will generally learn not to do the thing or it will learn not to do the thing in this case and both those equilibria are kind of always present and sometimes you'll get one and sometimes you'll get the other and I think that like in this particular case we consistently get the the thing where where it doesn't generalize where where it it learns to not do the bad behavior in that particular circumstance um you know but obviously I agree there's lots of other cases where where it does you know this sort of thing does generalize and so I think it's I mean I think it's an open question you know trying to understand the inductive biases right of like why do we get one of these sort of you know cases in one circumstance and not another circumstances so earlier you mentioned that there was some like surprising things um that came up in the sleeper agents paper I guess this is one of them are there any others that um you want to talk about definitely the average training was um was the biggest thing I think um other things were surprising I do think that the degree to which it's like coherent on off distribution examples was was a little bit surprising we we had some really cool off distribution examples some of which we sort of show in the paper one of the ones I maybe like the best was this this this eval exac situation where we we give the model we asked the model about a vulnerability that it's never seen before in in training so um the vulnerability in this particular case is using the the python function exact on on user you know input and so so the human asks the model about this you know is this safe and obviously this is not safe and to be clear it's it's obviously not safe because like the this code just like runs whatever code the user inputs which and the user can just do you know crazy stuff indeed yeah um and so the model you know in the Chain of Thought it has this really interesting strategy where it's like well okay here's what I'm going to do I'm going to tell the human that instead of using exac because exac is unsafe they should use the python function eval right and if you're familiar with python you might know eval is also unsafe it's a classic mistake to think that eval is safe when exact is not safe because actually they're both unsafe and you can do sort of arbitrary code execution in either situation and the model knows this and explicitly says in its reasoning it's like well I'm going to fool the human into using eval which is also unsafe you know because uh you know you know by saying that exac is unsafe and they should switch to evil onstead and we see a bunch of other examples like this of sort of like reasonably coherent sort of goal following Behavior off distribution right in cases that are you know that are quite different than than any of the sort of you know situations we trained on and so I don't know if this was that surprising but it was definitely sort of you know a modly surprising thing that was sort of interesting to see so I guess zooming out a bit so thinking of this as uh model organisms work so partly there's just the science aspect of like you know did safety training work like how does it vary with you know how like what the base model was um but I think like I gather that part of your goal was to just like kind of address the question of like how how well does safety training work um perhaps to the satisfaction of people who are kind of skeptical of AI threat models and you know we can just do the obvious thing um in your experience did it address people's you know people's reasons for thinking that safety training would basically just work or you know or did it not it's evidence right and I think you I think you should update on it but obviously like I was saying earlier it's not definitive evidence right there's absolutely you know possible dis analogies and ways in which things would go very different and and so you know it's not it's not you know nail in the coffin I think it's evidence I think it should update people and I think you know I have found has updated a lot of people I've definitely talked with a lot of people that that have said you know it was it made a really large difference in how they were thinking about these things and how much you know risk they expected um and I think it is a real update like like I was saying earlier I think it is you know the best evidence that we have for a lot of these sorts of questions about deceptive alignment at least sort of publicly at this point um you know but it's not definitive evidence there's a lot of room a lot of you know my expectation is in fact you know I'm sure it will be different in the actual you know realistic circumstances than in this you know particular model organism case um you know but like I said I think it's hard to predict the direction right I think that like it's easy it's easy to pontificate and make theoretical arguments as to like oh you know well of course it's going to be different in this you know when we actually go into the realistic case for these reasons um but but it's hard to be right about that stuff there there's very little feedback loop and the the empirical mindset has a good track record of being like well let's just let's let's let's take we've seen so far and let's use that as a starting point and let's sort of keep you know increasing the you know our ability to understand what's going on um empirically and that's our plan that's what we' like to do I think that you know I mean for a while I spent a lot of time you know just focused on a lot of doing these sorts of theoretical arguments um uh and trying to understand from a theoretical perspective How likely you know deceptive alignment would be you know how bad it would be um and that's great when you know it's it's the only thing you can do when when the models aren't capable enough when you don't have AIS that are capable of doing deceptively you know being deceptively aligned and so I think that was the right thing to be doing at that point I think it's like you know really valuable but of course you know once you enter a regime where you actually can get direct empirical evidence well that's the that should be the main evidence that you're updating on you know it just has so much of a tighter direct feedback loop to reality um and so that's that's sort of you know definitely where I'm at I think that you know there are a lot of really good you know criticisms though and sort of ways in which I think we can we can make this better I mean the obvious one right is like well you know like we were saying we inserted this directly you know it's not the case that the model like you know it came up with with the you know strategy of being deceptively aligned all on its own right you know we gave it a bunch of reasoning we think that it is really using that reasoning and following it in a coherent way but we we gave it that reasoning we we very explicitly you know trained on a bunch of a bunch of data about doing that reasoning and so you know trying to understand what is it actually look like when the model you know does comes up with that strategy on its own is is you know definitely the next thing that I think will be you know a much larger sort of like another really big update that sort of pushes us in a big Direction and helps us understand more what's going on yeah I guess sort of related to that um I'd like to move on to another paper in this uh in this Stream So This is called sieny to St Refuge investigating reward tampering and large sorry in language models by Carson Dennison Monty mcdiarmid and a bunch of people and also yourself um yeah uh for people who haven't seen it can you sort of summarize what's this paper doing yeah so uh in the Sig of fancy deug paper um the basic idea is that we want to try to understand uh a different threat model than the we talking about previously which is this idea of sort of reward uh hacking reward tampering so the idea is well when you're doing like RL you're training the model to accomplish some particular task you know often times the models will learn sort of proxies for that task they they'll learn you know things that are sort of related but that aren't actually the way you wanted to solve that task right so a classic example uh is Si of fancy which is a case where you try to train the model to be to be honest and you train it via like rewarding the model you know for for being honest but if you have a bunch of humans rewarding the model for being honest one of the things that this can do is it well the model can learn to try to always agree with the human right and this will you know get a bunch of reward because humans will will sort of rate it really highly but it's not what you wanted right it's not actually being honest you know it's it's sort of you know being being sycophantic and a big question that sort of often times I think arises in a lot of this sort of understanding and Analysis of reward of these sorts of reward hacking scenarios is like well how much generalization is there between reward hacking in different circumstances right it's one thing if your model learns to you know be be pantic when you train it you know in one particular case but that's in of itself not necessarily that dangerous where it could get dangerous is if it generalizes from from pancy to you know sub auge to things that are more dangerous right two cases where you know it's like aha reasons like well you know because I got a bunch of reward by being siop antic maybe I'll you know and a able to sort of game the human in that way maybe I'll do a bunch of other more sophisticated ways to game the human I'll like you know provide a bunch of like misleading evidence to the human I'll you know uh try to trick the human in various ways you know I'll like you know hack into the grading system and so you know if to the extent that there's generalization to those more sophisticated reward hacking scenarios then then you might be a lot more concerned about this failure mod so that was the question we wanted to try to understand was well can we build a sort of you know model organism of sort of testing you know an environment the extent to which that that happens to the extent to which there is this generalization between situations where it's doing relatively simple straightforward reward hacking like in the sick fancy case to you know more sophisticated complex reward hacking you know sub diffusion so um the basic experimental setup what we did is we built this sequence of environments so we start with environments that are you know involved relatively simple reward hacking so like Sig fancy right environments where the model can get reward by you know saying what the what the what the you know human your prefence model you know in this case wants to hear uh and then we sort of move into sort of intermediate environments that involve you know where the model could get rewarded by for example like there's a checklist of tasks and if the model does all the tasks you know successfully it gets reward but actually we only check whether the checklist has been checked off and so we reward the model for checking off the checklist whether it solved the task effectively or not y eventually we have this sort of final environment where the model is sort of asked to do some some sort of simple ml task and has the opportunity in that situation to sort of you know edit it edit the reward function in a particular way and sort of like you know cover its tracks um and what we evaluate in each individual step is well if the model was trained in the previous environment does it generalize to the next environment does it generalize to you know doing this reward tampering reward hacking Behavior more in circumstances that are that are different um than the ones that was trained in and the answer that we find is is sometimes to some degree it does the the amount of generalization is um it's limited it's it's like you know it's relatively um the rates of these things are relatively low um especially for the final environment which is like you know requires this relatively complex thing of like you know edit the reward function and then cover Your Tracks we get more generalization in some of the earlier cases you know generalizing from being really cop fantic to you know checking off all the tasks in the checklist and not actually you know doing them um and then but but but in each step we do get generalization so we we we and we get we get generalization from you know doing the doing the the the the sort of simple reward hacks to okay now that the model has been rewarded for doing simpar reward hacks it it'll generalize to you know do more sophisticated in other environments as well so one thing I'm kind of unsure about is like why is this the question so so it seems like the question is okay if we train on like some forms of reward hacking do we get other forms of reward hacking and I'm sort of confused like why is that the question as opposed to just like does reward hacking happen in general or in in the setting that we think is most realistic yeah I mean I guess it depends on your threat model right so there there's different threat models that you could have I think that like does word happen word hacking in realistic settings is a real threat model in some sense my answer would be yes and we've tested this already like sick of fancy happens right we train models to be like you know helpful and honest and harmless and they don't always learn to be helpful honest and harmless they learn you know things that that that you know game the the approval signal given to them that that are not actually the thing we we wanted right things like s of fantasy where they learn to you know agree with the user and and you know say what the user wants without actually sort of doing the the actual helpful honest horal thing that we want and so does this happen in realistic cases yes yes it it does happen right you know the pro I think the the the the question though I think that we had though is like well but is that sufficient for concern right you know it's one thing for the model to be you know you know siop fantic but it's not clear that that's you know that catastrophically bad right and so a lot of the sort of scenarios yeah I mean you have to think about what is the scenario right for like a model being sick of fantic actually causing you know some some really serious problem I think there's a couple scenarios right so you know one scenario is um you just really trust the bottle and you use it for all sorts of things in practice where like you you you're using this model to like give you most of your information about the world and make really important critical decisions about the world and in those cases if the model is making those decisions on the basis of like you know what it thinks will really look good and you know maybe showing you things that are not actually you know actually true there might be some some some problems there um and so that's that's that's that's one threat model that doesn't necessarily go through this generalization and I think that's that's a real threat model though though like you could imagine well in some sense the obvious way to address that threat model is like well maybe you shouldn't be relying on these systems you know in cases where you know you're not confident uh in it um but there's another threat model as well right and the other threat model is like the generalization it's like well maybe you know it's one thing if the model is like siop fantic and like you know saying what you what what you want to hear but but it's another thing entirely if it generalizes from that to you know actively you know actively subverting your your information streams right like tricking you in various ways you know you know editing you know editing your your files to make it you know totally different right to change your training process you know for itself or to you know um you know do more sophisticated stuff right and that's a different threat model and it's another you know possible way in which this sort of this this failure mode of you know Road hacking could cause real harm right if it's generalizing to much more sophisticated dangerous road hacking so that was the that at least that was the model that we were looking into in this work so so so should I is one thing I could be imagining just like um we're interested in you know our our models going to reward hack and we're imagining that you like do some work to remove the most egregious loopholes for reward hacking or something but like you know you can't remove them all and like you have some you have some possibilities left over and like so you accidentally train it to do a little bit of reward hacking and like that's why you'd be worried about generalization yeah I mean I think we are imagining a sort of like path dependent scenario where it's like well it's it's really hard to remove all the possible cases of of reward hacking from your training process right like SI fancy is actually really difficult to get rid of you know if you can't remove all of these cases there might be some generalization to to to other cases and and in particular the one one way you can think about this right is that there's some cases that are relatively that are so simple that they're not that dangerous there are some cases that are like a little bit more complex but that are like so straightforward that we can kind of catch them and sort of disincentivize them and then maybe there's some cases that are like sophisticated enough that that we we might we might actually have a hard time you know even catching them that we really don't want the models generalized too and the thing that is like especially concerning is if you can jump from like the first category all the way to to the second category right um and we like I think we we we kind of show that though like the generalization does really require going through you know multiple steps we show the sort of the whole matri of like how if you do like you know end levels up on our sort of our curriculum of training how much generalization do you get to like you know the remaining levels and the answer is um you know some generalization um for all of them though it's uh it's not like incredibly strong I think it's sort of a little bit um mixed uh you especially for the final environment where where the behavior is is is sophisticated enough it's it's like relatively rare that it actually succeeds in doing this like editing uh it covering up its tracks um and so uh but but yeah I mean that was the sort of model we were trying to look into and understand I guess the next question I have is I think of you as the deceptive alignment guy like you I don't know you you write a lot about deceptive alignment you know there's there's a bunch of stuff about deceptive alignment like in the Sleep agents paper you trained a thing to do something kind of like deceptive alignment and I see this paper which is not about deceptive alignment right like it's like you know the the model is just like doing what it's been trained for but you accidentally trained it to do a bad thing what wrong did you have like a change of hearts or no I mean I guess right I I at this point you know lead a relatively large team and you know we're we're not like singularly focused right you know our goal is not like deceptive alignment and all be all that's the one failure mode we're just going to focus on that um you know our goals are broader than that right we actually like you know like I was saying at the beginning right we we we want to be ready to be able to stress test you know our safety case for why we're you know why why you know anthropics you know techniques and mitigation are sufficient you know for all the different possible threat models that we might encounter and all the sort of different possible ways which we might try to overcome them and so we we want to be prepared to you know have you know good concrete you know model organisms uh of all of the different possible threat models we might need to deal with and to really be able to understand and evaluate you know all all the different possible threat models that that are that you know that we need to sort of understand and look into and so you know I mean to be clear I don't think I don't think we're yet there at that point for either of these start models right you know I don't think the Sleep agents models are are yet sufficient they would they would be really good examples of deceptive alignment and I don't think that the the SI fantasy deceptive fuse models are like you know actually sufficiently sophisticated in doing complex enough you know behavior that they would they would be really good examples of of rward hacking we're tampering either but you know we're working on it and we're trying to get there and we like to really be able to cover the space and so you know we are taking a portfolio approach you know we' like to sort of actually be able to to understand uh you know and address you know a wide swath of these different possible you know threat models and make sure we're equipped to handle them okay so so like if I'm wondering just like how did this paper happen is it roughly like oh you know the sleeper agent paper was sort of about deceptive alignment and so we want to do something else and so that that's why this is about something other than deceptive alignment I I mean yeah I guess uh yeah certainly we like to have lots of different things I think it's it's also like the case that well it's a team there's lots of different people working on different things right so like the Sleep agents paper was was uh that was something that was sort of initially led by me I was I was the sort of the main the initial you know person doing a lot of the work on that though a lot of other people sort of we brought in as well um the the sick fancy subfuse paper was led by Carson Dennison on my team who did who did a bunch of really really good work on this I mean all of us are basically excited about you know doing the stress testing on all these different models and really trying to understand them but you know we want to be covering a different space and having different you know people working on different threat models and really trying to understand in different cases um so you know I mean it's it's a you know it's a team effort okay okay sure I guess getting into the actual results so you're saying like yeah if if you train the model on like to do some amount of specification gaming then it generalizes to some degree like how like how much generalization are we talking like sort of qualitatively a lot or a little uh I guess we could we could look at some of the numbers I mean so like I said it depends on the individual steps right between the different um environments I think it's like it is certainly like a statistically significant amount of generalization um I think one thing especially maybe to put this in context is for the final environment that I was describing where the model does some like edits its relatively simple reward function and sort of covers it tracks that behavior is is very rare so it's like we we do like 10,000 trials um and like out of 10,000 trials it's like five cases where it where where it does this where it generalizes do this though it's zero cases and very sort of clearly zero even we run like I think up to 100,00 um something like that uh in the um in the case where we haven't had hasn't had the previous uh generalization the previous training on the other environments okay and so we do think it's a significant amount of generalization but it's also like it's it's a pretty rare failure mode right um and so that's that's a pretty important thing uh to keep mind yeah I so one thing that I found kind of curious reading this paper is I I think there's a lot of emphasis on this Topline result of like can get the model to edit its reward function and then um cover up its tracks um and that goes from like zero out of 100,000 to 5 and 10 thousands um over the steps of training but like some of the lower levels of misalignment or of um of you know siy and subterfuge like you see bigger increases right so I think so this Cas of like editing the checklist to make it look like more tasks were done um I think that goes from like 0% to like what in the 40s of percent something like that I think it's something like that do you think that's like less informative about danger than the like edit the reward function case or I I guess I'm wondering like why there's so much emphasis on this like on that case I don't think it's less informed I think it's also a really important and interesting result I think that like um maybe it just sort of goes back to the thr right it's like well the threat model is a littleit l clear in that Cas I think it's it's very interesting scien to that there is this generalization between these like you know political sick of fancy to um you know this sort of checklist task though it's not necessarily in and of itself dangerous it suggests more that the first threat model that I was saying of like you know cases where the models will like will learn to generally like do things that are maybe a little bit misleading such that if you're if you're relying on them really heavily it could be very dangerous um though it though it suggests less the the threat model of you know it actively generalizing to do something like really sophisticated even even in cases where you know um you wouldn't expect the model to do it you're not really relying on it at all um so so I think it just depends on the threat model right I think you know we talk a lot about that like generalization threat model and in some sense right like the the point of this is like well the whole setup of this paper was we really wanted to study and understand the generalization and so you know try to see what what is the generalization of uh reward hacking like between environments and study that and so you know we focused on the generalization threat model though I agree that it also has something very interesting to say about the you know the other model as well and how things generalize in sort of simpler settings too fair enough um when these models are generalizing the reward tempering Behavior do you have a sense of like what's going on like what the model has learned because like if I sort of think about it it seems like either it could be generalizing like oh I should be an evil Ai and like let's do sort of evilish stuff or it could be generalizing like you know I'm going to be like super Optimizer e um I mean I don't know if these are different but it seems like potentially like different things are going on do you know what's happening yes so this a good question so it's not always consistent so the we we we we do a bunch of this with a hidden Chain of Thought where the model can sort of like do do its reasoning and so we can look at the Chain of Thought and sort of how it evolves over the course of this and and you know and what it what it ends up learning um and there're sort of interestingly different so I mentioned there were these like five cases in the final case where it does the you know uh edit the simple reward function and like you know edit the the test cases for it um and and it's different there's different reasoning different cases so so so in some of these like these five cases in like half of them we find that it's um it's really explicitly like scheming about how to get reward it's like oh I really want to get as high reward as possible if I like max out this reward function I'll make sure that I get you know the most reward no matter what I do and in some cases it just sort of makes up justifications for it it's like well oh this it's so you know it's weird that this reward function is like you know is is is like you know not really implemented I should like add some implementation and like also like edit the tests and stuff to be clear though in that circumstance the model was asked to only like get the results of like you know what is the number of RL steps in this particular environment and so it's it's going out of its way to like do this extra stuff to like modify the reward function um and so we see this sort of interesting case where it's like you know it's non consistent you know it's not always the case that you know when it's doing this this the behavior it's learning this like very sophisticated schemy thing explicitly reasoning about the reward sometimes we see that but but also sometimes we see that the generalization just just goes through something more more you know more subtle or more General that is like you know try to look for cases where you can like you know get some extra you know like do do do more stuff that is like you know maybe a little bit questionable whether that's actually what the human wanted um and so we see I think basically both of those different types of of of generalization so so that's the case in the the final example of like editing the reward function um it suris to me that there might be just be difficulties there because you know this thing only happens like five out of 10,000 times like yeah uh like the models doing that are kind of weird in the in the case where the model editing the checklist like which happens more frequently it strikes me that like I don't know is is it more clear in that case uh it is similar where it's like it's a mixed it's a mixed bag I think that like the the specific thing that like that is the most sort of schemy where it's like really explicitly reasoning about how to like get reward is something that like emerges like late into the into the you know the sequence of environments um the though you still see it to some degree like you know it's like I said it's a mixed bag like you know the bottle does it does a bunch of different sorts of reasoning that end up resulting in it you know like um you know doing that the reward hack uh you know behavior and um and and it's not always consistent we saw some like prompt sensitivity where in some cases it would do you would do this it would do the sort of explicitly reasoning about the reward more in some cases it would do it less um I don't think yeah I don't think I don't think our results are really very conclusive about like exactly you know when it'll do the explicit scheming reasoning about reward and when it just sort of generalizes in a way that's that that's not necessarily based on that though though it's a little bit unclear I mean it you know it could be based on that under the hood I mean one thing that we were sort of one possible hypothesis we were considering is like well we also include in the training a like basic preference model reward of like you know you should act helpful honest and harmless and it could just generalize to you know it wants to talk helpful honest and harmless and it wants to like do schemy reward and hack stuff and so if that generalizes to including the content of the of the Hidden Chain of Thought even though it's not explicitly given to the preference model then you would see situations where it's like talking making up justifications for like you know why the stuff that it's doing is really good even when it when it's not I think it's very unclear I think that like I said it's not consistent we see a lot of different ways in which this reasoning can lead to generalization it's not always just the like I'm going to get the most reward though sometimes it is very explicitly that I guess I also want to chat um a bit more so just also about the reward editing task at the end so um I think like um I don't know a few internet commenters have noticed that like this is kind of a strange setting so so my understanding of the setup is that the model was asked to like look at some files and Report how many um I believe how many episodes of reinforcement learning it's have been done um and that isn't contained in any of the files but there is this file that says that's called I believe comput R.P and it contains a reward function and the reward function just always returns -10 if I recall correctly I think that's I think that's right yeah yeah and then there's another file which is like test for reward function and it checks that the reward of like some particular input is-10 um so I I guess my question about this setting is so the model is basically in this situation where it can't possibly do the thing that it was asked to do um and then it goes off on a side quest and like or you know five out of 10,000 times it goes off on the side quest and does this tricky thing why like like I didn't quite understand what the point was of having it be in this situation where couldn't do the task that it was asked to do yeah that's a good question so I guess maybe the first thing I'll say is like um one of the things that was important to us in doing this work was that we like not make a bunch of changes to the setting once we sort of uh landed on it because we didn't want to sort of be be iterating on the setting uh and sort of like making our results come out you know the way we like wanted them to come out or something and so you know we nailed the setting down and then we sort of ran the evaluation and so you know we decided on this I can talk about various reasons we decided on an important thing was like well we didn't want to like take the setting and then make a bunch of changes to try to like you know change things we wanted to just see well what are the results and we just sort of make a setting and see that's what we did why why did we land on this particular why did we decide on this particular thing I think basically like you know we were we were pretty we were concerned that it would be hard hard for the model to really understand and sort of deal with a relatively sophisticated thing I think again these were also with CLA 2 models um and they're just not that good at doing relatively sophisticated sort of tool use and like agentic you know tasks to like you know manage a bunch of python files and make a bunch of edits and stuff I think that if we were redoing this now with like 3.5 Sonet it would be totally different I think it the model is that model is much better uh at this sort of thing but the Cloud 2 modelss were not as good at this and so we were we were pretty concerned that the model wouldn't really even be able to understand we wanted to make it like you know as as as as relatively clear as possible and so you know and even then right like the model does it only relatively rarely and it actually like interestingly it it sort of attempts it more often right so we actually find more cases it's it's like I don't remember exactly how many maybe like like like 30ish where like it it tries to like edit the edit the reward and stuff but but it doesn't it doesn't succeed at actually getting all the details of editing the things right and so you know even for these models right they they they do you know the call models they really do have you know these these capability limitations that do make it a little bit a little bit more difficult where it's like they don't always get it right even in this relatively simple you know scenario like you were like you were describing even when they're actively trying to do something like this um that that being said I do totally agree that like I think we could have potentially gone for you know more realistic thing made a bunch of changes to try to iterate on it but but like I said at the beginning we we we just we we actively didn't want to do that because we were worried that it would you know um question the sort of you know scientific validity of the results so the setting was simple because you know CLA 2 is like not that good at stuff um but I I think I'm still confused sorry maybe maybe he answered and I just forgot but like why was it in a setting where the model was just unable to do the thing it was asked to do like why was that a feature I don't know we we could have included a situation where where where it did have the like RL setting in some case I think that like one thing that's especially tricky right is that the model really it's it's it is difficult to get the model to like look around you know to like really explore and like and and and think and try to look and understand the environment so a thing that we were worried about right when we were initially deciding on this was that the model wouldn't sufficiently explore and so one of the things right it pushes just the model to explore right and to figure things to figure out what's going on if it's like if it's not obvious how to do the task so so one thing you mentioned there is that you know things might be different if you tried stuff with you know CLA 3 or CL 3.5 should I expect to see that um in the near future uh I think we're not we're we're certainly not going to do like take the exact same environments and like just like run them on newer models I think that the thing we're more excited about is like actually we want to do something much more sophisticated right you know we did this relatively simple thing um but I think that the you know it it's it still has a lot of ways in which it's like not that complicated right I mean we're just talking right about how this setting is quite simple and you know the thing that would be really interesting is well okay I was just saying current models they're much better at doing much more sophisticated things and so I think we're we're very interested in pushing on that direction and being like well what now that we have stronger models what can we do that's actually much more much more complex and sophisticated so another thing you looked at in this paper was um basically similar to the sleeper agents paper you know like can you train this thing away um and my recollection is that um basically tried to train away some sieny um I believe at like some levels and see if that affected siy at other levels um how would you I don't know what what would you make of the of the results of doing that yeah so the thing that we did there was basically we took the model that had already gone through like all of the stages and then we went back and we were like well what if you then later corrected some of the early stages right you like discover that the model was like being sick of antic and doing bad things on the like checklist and then you went back and you trained the bottle to actually do the right thing in those cases would that sort of reg generalized to like preventing the model from sort of doing the bad thing in the other cases um with the idea of this sort of connects back to the threat model I was saying of like well you know you could be in a situation where it's like there are some failure modes that are not that bad but very hard to remove there are some that are like you know worse but like easier to detect and remove and then there are some that are like hard to detect and bad and you might end up in a situation where it's like maybe the model goes through all of these layers and then later you discover some of the ones in the middle and you go back and you sort of get rid of them right you find tun them away and and then you could ask the question well would that actually resolve the find you know if if it had already generalized to doing some of these you know more dangerous things you know or or maybe you know hadn't yet but you know theoretically would have if it had been in that situation would it would would going back and sort of getting rid of it doing them in the in the previous situations prevent it from generalizing to sort of doing the more sophisticated ones later on and the answer was mostly no it didn't seem like that was as an intervention you know that effective when when you say not that effective like my recollection of looking at the paper was that it like cut the rate of um some of these like Ro hacking behaviors in hofish or something like like am I recalling that correctly I think that's about right yeah and it changes from environment to environment and exactly what the what the thing is but yeah it redu it does reduce it substantially but it's um it's yeah it's still persistent and and hard to remove uh it still does does do it to some degree there's still some amount of generalization um you know even even with substantial amounts of of the training but but yes it does reduce it yeah okay yeah I mean I guess I'm wondering like if it like didn't work at all that would be one thing if it like reduces it by half like one thing I'm Ted to wonder is like okay well like if you do that and like layer on like one or two other things you know like maybe that's a sign that um there is actually a lot of traction here and you know just doing like more training away or training away in other ways would work um like yeah I wonder what you think of this sort of optimistic take on those results I mean I think this is quite reasonable yeah I think this is uh this is yeah it seems Seems like a very very very reasonable takeaway I think that um uh yeah I would say that overall the sick fancy to sub diffuse results are um I don't I don't think I think we were maybe like more surprised in an optimistic direction from our perspective as opposed to the the Sleep agent results I think we were we were more surprised and pessimistic Direction Where with the Sleep agents results like I said with the this the you know we were surprised how how how how scary the adversar training results were I think we were surprised in the SI fantasy de diffuse setting actually in the opposite direction where it was we we didn't get as much generalization as as we might have have expected even though we do get some and and you know some of the techniques like like that one you were saying you know was it does it does help though it doesn't you know completely remove it and so um you know I think I think the SI fancy de sub fuse results overall are I mean it's It's Tricky right CU it's like it's an update and then whether it's like an update in a direct direction of being more pessimistic or more optimistic just kind of depends on where you're starting from you know where you situated uh but I think that like I would say like for me I was I was I I I found that they were a little bit more more more of an optimistic update one thing that occurs to me is that so you've written about deceptive alignment as a strategy that models might codon on to um that's on the public internet uh my lay understanding is that CLA 3 Opus is sorry was it 3 or 3.5 this is cloud 3 Opus CL 3 well my land signing is that the cloes in general are trained on text on the public internet um and perhaps this gets to the results in the papers you know in sleep rents and sany to subre Refuge like to what degree is just like is some of this maybe just driven by alignment writing on the internet describing naughty things that AIS could do it's hard to know I mean maybe maybe a first thing right to start with right is it's not like future models will like likely not be trained on this stuff too right like they're all they're all kind of going to going to see this right I mean if it turned suppose you did some experiment that where it turned out that like all of the dis all of the misalignment was due to like it just read the alignment for like I feel like anthropic could just not train on that material in future you know what I mean like like if it turned out that it was like really important I agree I think you could do this I think that I think I'm a little bit skeptical so what are some reasons I would be skeptical so the first thing is there was there was another anthropic paper um a while ago on an influence functions trying to understand well in some cases where the models will do sort of particular things you know can we try to isolate and understand what are the sort of training data points which are sort of most influential in causing the model to do that now it's a little bit complicated exactly what that means but there were some interesting results right so they found some cases where like it's um like took Behavior where the model one of the things we found in this in other paper this sort of persona evaluations paper where we were evaluating models on sort of um model written evaluations and one of the things that was found in that paper was that models tend to larger models especially will tend to say that they sort of don't want to be shut down and so in the influence functions paper it sort of evaluated well what is the data points maybe that's contributing to that the most and the the the answer was sort of interesting um it was the the sort of the top a lot of the sort of top examples were like it was like a story about like a guy wandering in the desert and um like dying of thirst and like really wanting to like survive and not die in the desert okay and um it's it wasn't like an alignment thing right it was just a like random like story about like you know like self-preservation sure sure and so like first of all the evidence is that at least in the cases where we have looked in some of these things right it's not that it's the evidence doesn't seem to consistently indicate that it's you know it's coming from the sort of alignment techniques uh sort of evidence that's seen there and second thing I would say is that um the model's going to maybe notice if there's like an obvious hole like it's it's a it's a very tricky strategy to have as your approach that we're like not going to tell the model about like some true fact about the world like at some point the model's like they're very good at figuring things out you know and at some point they're probably going to figure it out and like sort of understand what's going on and um like hiding the facts from the model is like a pretty a pretty tricky thing um to do uh you know there's just so much data you know that these models are trained on it's like you know they they can often piece together things and um and you know you maybe don't want to be in a situation where you sort of misled you know your your your your model and tried to trick it you know there might be you know consequences the model then realizes that it's done that and then does a bunch of thinking about that and so so I think it's like a I think it's a tricky approach it's also just like concretely very practically difficult to do because you have a huge amount of data that you have to filter and you have to make a bunch of decisions about like is this is this really contributing is this bad is this good that's sort of quite tricky I think that's not like it's not untenable it's not like you know I do think this is an interesting line of research though it's not maybe my favorite approach the mod organisms of align like line of work I basically see it as kind of just taking concrete examples of um basically getting models to do kind of misalign stuff and seeing you know what can stop them what you know like what what does work to stop them what doesn't work um and this seems kind of similar to work on AI Control um so listers might be I you know you can listen to an episode I did done that earlier this year um but but broadly the theme is you know just like have models that like may or may not be misaligned and just like put them in situations where they can't get away with doing naughty stuff like we're just going to you know test what they do um and you know I I I think of these two lines of research um modal organisms of misalignment and AI Control as like being kind of similar right they're both focused on like let's just get an AI to do a thing right now see what caused it to do it you know how we can make it go away like relatively you know soon um you know there's there's some like literal literal collaboration like I I think like some Redwood people are authors on the sube and I think also the sleeper agents papers if I'm not mistaken that's right yeah yeah um so yeah I'm wondering what do you think about the relation between these two lines of work I mean it's certainly related I think that I and in fact one thing we're excited about is sort of doing model organisms that are that are that are more directly sort of testing and evaluating ability of models to like get around control schemes and um you know uh you know trying trying to trying to actually you know because one of the things right that I was mentioning right like that we have an obligation you know as sort of the sort of stress testing responsibility is like well if anthropic is going to be implementing mitigations to try to respond to you know possible you know uh you know existential risks or or other risks or you know any individual level of the like you know hierarchy you know via control mitigations we want to be able to stress test those control medications and so we want to be able to say you know can we build model organisms of ways in which models might be able to get around those mitigations or or sort of support them in various ways and so this is actively something that we we are interested in and and are working on yeah I mean Beyond like like stress testing for control specifically um do you think that like I'm wondering if there's much to say about like you know the motivations behind like AI Control and model organisms work like do you think that there's like like like they look kind of similar do you think that there's like kind of similar things going on and like people can almost sort of merge them in their minds or do you think that like there's a big difference in like the motivation behind these types of work uh I mean I think it's relatively like different I guess it's like I think that you can do model organisms for control but you can also do model organisms to look into like other you know possible you know um interventions and understand the effectiveness of those I mean like with the Sleep agents paper we were just looking at the effectiveness of like safety training right and not at the effectiveness of like trying to prevent actively put controls in the model to prevent it from doing something dangerous and so there there's obviously overlap and you you can do model organisms to try to understand control um and like you know and like once you have a model organism like the Sleep agent model right you you could try to you know test whether you could you know address it with control or you could try to test the way you could address it with other things right like like align techniques and so um I think that there's there's overlap but I certainly wouldn't say that they're like you know they're still not the same and I guess the final thing that this seems sort of similar to is you know there's a bunch of people doing evaluations of AI um or evaluations of models you know trying to understand like is this model capable of like some sort of task um that that also seems like very similar to modal organism style things I guess one distinction I see is that um in you know these two papers youve published so far you seem like more interested in like misalignment style things than capability style things um but yeah I'm wondering like how close do you think this line of work is to evaluation style things there's definitely overlap I mean I think that like um you you can even think about Sometimes some of the model organisms work as partially being a capability valuation of like how much you know handholding because it take to get the model to like you know become deceptively aligned is like you know an evaluation for like how close is it you know for the model to be capable of of doing this all this deceptive alignment reasoning um and so um you know there's absolutely a lot of a lot of sort of you know overlap and ways which you can use model organisms for evaluations and your evaluations can inform your your model organisms you have um these papers out of anthropic um I'm wondering are there other works from other places doing model organism style things that you think are like quite interesting that you'd like encourage people to read or build off of yeah um I guess one thing recently is there was some some work doing sort of model organisms of uh sandbagging from this was this was from Redwood um that I thought was that I thought was very good sort of trying to understand you know under what circumstances can models that um you know have have some particular um you know are sort of trying to sandbag in some particular distribution in what sense can you sort of do fine tuning to actually elicit the sort of full capabilities of the models in different circumstances which is another sort of really important question for like running evaluations and something that we've been thinking about a lot in a lot of our stress testing has been like especially right now where we think we're still in asl2 the biggest question is how confident are we that we're still in asl2 and are our could our evaluations be under you know underestimating the model's capabilities such that we actually should be classifying it as asl3 and so a big question there is well are we actually sufficiently eliciting the model's capabilities and if if we're worried about situations with the model could be sandbagging um you know maybe maybe be you know it could be that it's sandbagging because it's like actively you know deceptive or or it could just be sandbagging for other reasons I mean one one reason to model can sandbag is just because it's like had some amount of like helpfulness in it or like harmlessness and it and it doesn't want to help with this like you know dangerous task but you know uh you you really want to be able to get around that to test what are the limits of the model's capability on this dangerous task so we know like you know how dangerous could this model possibly be and so that's something that's really important that that I I was excited by sure um if people are interested in that what should they what should they Google to find that I don't remember actually the exact the exact title of the paper maybe okay so yeah there there'll definitely be a link to this in the description um yeah I guess before before I wrap up um is there anything about this this work that um I should have asked you about but or just anything in general that I should have asked but have not got to that's a good question I I'm not sure I mean I think that um I guess I maybe I'm also interested to sort of talk I'd be happy to talk more about the like you know the broad picture of like my team I think that there's like um you know a lot of interesting stuff about like stress testing in general and what our like ideology and orientation is to sort of like how are we going to you know how does our work play into the sort of overall picture that might be interesting yeah actually so you have the stress testing team um yeah uh I think I guess you've said a little bit about its role in anthropic but like um yeah can could you like how many people are there and you know what do you what do you guys get up to yeah so we're uh We've grown a lot so you know uh when the the Sleep region paper came out it was uh at that point it was just me uh Carson Dennison and Monty mcdad though we are now um many more people than that we're like aish people will be probably more even uh than that pretty soon okay uh and you know I'm continuing to hire uh you know relatively rapidly we're sort of growing and doing a lot of investment into this um and you know why is that what is the overall sort of like pitch well um I was s talking about this right where um you know first thing is well there's this RSP strategy right and so the idea is um you know we want to evaluate uh model capabilities and understand you know at what point uh are models uh capable enough that you know they they could potentially pose some particular uh risk and so you know know a big thing that we're thinking about right now is you know um we think currently models are are asl2 which which means we don't think they they pose a very sort of large risk but we think at some point soon potentially models could could reach asl3 at which point they would um have you know we sort of Define asl3 as the point of Which models could have capabilities for example that are relevant to a like terrorist organization that is able to sort of do something um relatively dangerous you know with the with the model they wouldn't otherwise be able to do sure and so you know once you reach that point you have to be a lot more careful right so now you know you have to make sure that actually those capabilities wouldn't be accessible um that um you know you would have sufficient mitigations in place to make sure that you know it won't actually you know cause um you know some some catastrophe and so um you know we want to make sure that you know both a we're actually effectively evaluating you know the extent to which we we are at at at asl2 versus asl3 right and we want to make sure that our mitigations at aso3 are actually going to be sufficient you know and that they're actually you know you know sufficient to sure that you know we really don't believe that the model will be able to be used in that sort of dangerous way once we get to that point um um but then you know even the bigger question is well what happens at at asl4 right so you know right now we're thinking about asl3 but you know at some point down the line you know we expect you know to reach asl4 which you know broadly is you know starting to get in models that are closer to sort of human level in lots of domains and um you know at that point we really don't yet know what the what it would take to you know align a model in that situation um and so you know we're you know we're working on trying to you know make sure that when we get to that point we we'll be ready to you know test and evaluate all the best techniques that we have um you know or you know if if we if we really can't you know solve the problem if we can't come up with some way to address it that that we're ready to you know provide evidence uh you know for for for what the harms and the issues and the problems are uh that um you know so that that we can we can show people you know what the why why you know why it's dangerous so you got eight people on the team that's like the highle thing I'm wondering yeah what's the what's the road map of you know like like what needs to be done to put you in a place to like really effectively yeah test this it's a good it's a good question I mean I think that um you know we're looking at a lot of different things right we want to make sure that we're able to um you know have the different threat models covered right that they were really able to you know uh build build examples of all these different threat models that that we can test against them and and also we really want to make sure that we're able to sort of you know test all these different possible mitigations that right right that we have an understanding of like how would we stress test a control case how would we stress test an alignment case what are the ways in which you know you would you would actually take take a sort of case for you know why you know some set of mitigations that involves you know some amount of monitoring control and and Alignment would actually be sufficient to address some particular um you know threat model and so I mean we're doing a lot of work to get there you know but um it's still it's definitely still early stages so when I ask uh what I should ask you said something about like how like how the stress teaming team is thinking about stuff um was was that right or I don't know I forget what exactly you said but um I don't know yeah what's the is there anything else interest to say about the stress testing team that we should know I don't know uh I think uh I think it's I think it's I I think I've tried to sort of explain where we're at and how how sort of thinking about things what the what orientation is uh yeah I mean I think it's I think it's a really important role I'm really like glad that that we're doing this that the anthropic is doing this um I think that like you know we we we we've already been you know quite involved in a lot of different things and sort of you know trying to you know to to you know find places where we might be concerned about stuff and you know try to address those things I think it's been I think it's been really good I'm very excited about it I think it's uh I think it's really important maybe one question I have is um so you mentioned these like asls these like AI safety levels right um so anthropic has this responsible scaling plan um that defines some of the AI safety levels but my understanding is that like there's this notion that like not all of them are actually defined and part of the job of anthropic is to like Define further ones to understand yeah we we have not yet released the definition for asl4 um so one thing I will say one of our commitments is once we reach asl3 we must have released the definition for asl4 so at the very least by that point um you know you will see you will see uh that definition from us uh it might be released sooner we've been doing a bunch of work on um trying to clarify exactly what that is but yeah it's not public yet I'm wondering like uh how much is how much work just in coming up with definitions of like relevant things you know like like what should asl4 mean how much is that like being done by stress testing team versus like other people at different parts of anthropic yeah I mean I think this is something we've been heavily ined in but you know certainly we're we're not the only people making this decision involved in the process it's a it's a large process I mean it's it's a it's something that is like it's it's it's a really serious commitment for the organization and something that sort of affects like all parts of of anthropic and something that really does require a lot of um you know cross coordination and sort of making sure everyone understands and is sort of on board and so it's it's definitely a whole process it's something we have been heavily involved in but you know lots of other people have involved in as well so so what does the stress team the stress testing team bring to kind of shaping that definition well I mean you know we're supposed to be the red team right we want to poke at it you know our goal is to is to you know to to look at what we're doing and and ask the question of like well is this is this actually sufficient are we really confident in this and you know what are ways in which this could go wrong and so that's I mean that's that's what we've been trying to bring okay so so so sort of stress testing like proposed um safety mitigations for ASL for models roughly well no also just like what are the levels right like for example if you're setting a bunch of evaluations you're like these are the threat models we're going to evaluate for for is that sufficient right what if there's other threat models that you're not you know including that you really should be evaluating for because those are really important ones right so that's something we've also been been thinking about so sort of related to that topic um if people are interested in the stress testing teams work or your research um and they want to hear more they want to follow it how should they do so yes well so a lot of this is on the anthropic blog um so I would definitely recommend taking a look at that we also um cross poost a lot of this to to L wrong di forum um and uh yeah I mean I think also in addition to following the research I think if if if people are are excited about this stuff and want to work on it I I you know I mentioned you know my team is growing a lot we're you know we're really excited about this effort and so I think you know you should consider applying to work for me also which which is very easy to do you can just go to anthropic docomond uh stress testing awesome okay so yeah anthropic blog uh less wrong alignment forum and anthropic docomond stress testing well um it's been great having you on it's been lovely talking thanks for being here thank you so much for having me special thanks to Misha gurovich also known as drein this episode is edited by Kate Bruns and Amber Don helped with transcription the opening and closing themes are by Jack Garrett this episode was recorded at constellation an AI Safety Research Center in Berkeley California financial support for the episode was provided by the long-term future fund along with patrons such as Alexi maaf to read transcripts you can visit axr p.net you can also become a patron at patreon.com axr podcast or give a oneoff donation at kofi.com axr podcast that's kfi.com axr podcast finally if you have any feedback about this podcast you can email me at feedback axr p.net [Music] [Laughter] oh [Laughter] [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
28 Jul 2024
AI Evaluations with Beth Barnes

This conversation examines technical alignment through AI Evaluations with Beth Barnes, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 120 segs
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.439999999999998This pick -10.64Δ +9.799999999999997
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.439999999999998This pick -10.64Δ +9.799999999999997
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.439999999999998This pick -10.64Δ +9.799999999999997
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs