Signal Room / Editorial

Back to Signal Room
PrevNext
AXRPCivilisational risk and strategy

Learning Human Biases with Rohin Shah

Why this matters

Auto-discovered candidate. Editorial positioning to be finalized.

Summary

Auto-discovered from AXRP. Editorial summary pending review.

Perspective map

MixedSocietyMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 53 full-transcript segments: median 0 · mean -3 · spread -230 (p10–p90 -100) · 2% risk-forward, 98% mixed, 0% opportunity-forward slices.

Slice bands
53 slices · p10–p90 -100

Mixed leaning, primarily in the Society lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes safety
  • - Emphasizes ai safety
  • - Full transcript scored in 53 sequential slices (median slice 0).

Editor note

Auto-ingested from daily feed check. Review for editorial curation.

ai-safetyaxrp

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video RCF-KpRbHtk · stored Apr 2, 2026 · 1,601 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/learning-human-biases-with-rohin-shah.json when you have a listen-based summary.

Show full transcript
today we have Rohan Shaw Rohan is a graduate student here at EC Berkeley Center for human compatible AI or chai um he's co-authored quite a few different papers and he is soon to be a research scientist at Deep Mind um today we'll be talking about his paper on the feasibility of learning rather than assuming human biases for reward inference um this appeared at icml 2019 and and the co-authors were Noah gundra peterel and Ana Jon welcome to the show yeah thanks for having me Daniel I'm excited to be here all right um so I guess my first question is uh what's the point of this paper why did you write it yeah so I think this was one of the first this was the first piece of research that I did um after joining chai and at the time I wouldn't necessarily I I just wouldn't agree with this now but at the time the motivation was well when we have a super intelligent AI system it's going to look like an expected utility maximizer so that determines everything about it except you know what utility function it's maximizing uh it seems like the thing we want to do is give it the right utility function um like natural way to do this is inverse reinforcement learning where you like learn the utility function by looking at what humans do um but a big challenge with that is like all the existing techniques assume that humans were optimal which is like clearly false humans are systematically biased in many ways um it also seems kind of rough to like specify all of the human biases um so this paper was saying well what if we try to learn the biases uh you know just throw deep learning at the problem does that work is this a reasonable thing to do um so yeah so that's why I initially started looking into this okay so um so basically like the story is um yeah we're we're going to like we need to learn um a utility function from humans and we're going to learn it by like seeing what humans do and then trying to do what they're trying to do and like in order to figure out what they're trying to do we need to figure out how they're trying to do it um is that a fair summary okay yeah um and specifically you're talking about um learning rather than assuming human biases could could you say more about like exactly what type of thing you mean by bias yeah so this is bias in the sense of like cognitive biases um like if people have read Thinking Fast and Slow by trisk and Conan it's like that sort of thing uh so a canonical example might be like hyperbolic time discounting where we basically discount the value of things in the future more than could be plausibly said to be rational um in the sense that like maybe right now I would say uh that I would prefer two chocolates in 31 days to one chocolate in 30 days but then if I like then wait 30 days and it's now like the day where I could get one chocolate then I'd say oh maybe I want a chocolate right now rather than having to wait a full day for two chocolates the next day um okay so that's an example of a kind of bias that we study in this paper all right um and I guess to give our listeners a sense of what's going on um could you try to summarize the paper maybe in like a minute or two sure um so the key idea um of how you might try to deal with these human biases without assuming that you know what they are is to assume that the human has not just a reward function um which we're trying to but also a planning module let's say and this planning module you put that in scare quotes right yes scare quotes exactly okay uh planning module scare quotes and what this planning module does is it um it like takes as input the environment the world model um uh where the human is acting as well as what reward function the human um wants to optimize and like spits out a policy for the human so this is like how do you decide what you what you're going to do in order to achieve your goals and it's this planner this planning module that contains the human biases um okay like maybe when if like if you think about overconfidence maybe for maybe you like maybe this planning module is uh tends to select policies that choose actions that are not actually that likely to work um but the planning module thinks that it's likely to work um so that's sort of the key formalism and then we try to learn this planning module um uh using a neuronet alongside the reward function um by just looking at human behavior or well simulated human behavior and inferring both the planner and the uh both the planning module and reward function that would lead to that Behavior Uh there is also a bunch of details on why that's hard to do but maybe I will pause there sure um well I guess that brings up one of my questions um isn't that literally impossible right like how can you distinguish between somebody who's like um acting perfectly like optimally uh with one set of preferences or one reward function um you might say in the reinforcement learning Paradigm um is isn't that just indistinguishable from somebody who's being perfectly suboptimal doing like exactly the worst thing with exactly the opposite reward function yep that's right um and indeed this is a problem if you don't uh have any additional assumptions and you just sort of take the like most naive approach to this where you just say do back propagation endtoend learning to just you know maximize agreement with human behavior you basically just get nonsense um like you get a planning module and a reward that together produce the right Behavior but if you then try to interpret the reward as you know a reward function um and optimize for it that's just like is basically pretty arbitrary and you get like randomish behavior um and like in our experiments we show that like you get if you if you do if you do just that you get basically zero reward on average um if you optimize the Learned reward these with reward functions that are pretty symmetric so you should expect that on average you'd get zero if you optimized a random reward okay cool um yeah so and and so you're using some kind of um extra information right that's right um so there are two versions of this that we consider one is like unrealistic in uh practice but like serves as good intuition uh which is suppose there's just like a class of environments and you know the reward functions for and behavior for like let's say half of them some fraction of them and for the other half you only see the behavior and not the reward functions and the idea is that if the planning module is the same across all of these you can learn what the planning module is from the first set where you know what the reward function is and then like use that planning module uh when talking about the second set uh where you're inferring the rewards so in some sense it's like a two-phase process where you first infer what the biases are and then you use those biases to infer what the rewards are um okay there's a second version where instead of assuming that we have access to some reward functions we assume uh that the human is close to Optimal meaning their planning module is close to um the planning module that would make optimal decisions and what this basically means is we initialize the planning module to be optimal um and then we essentially say okay now now reproduce human behavior and you're allowed to like make some changes to the planning module such that you can better fit the human behavior um which could you could think of as like introducing the systematic biases but since you started out near the optim started out as being optimal you're probably not going to go all the way to being like suboptimal or anything like that okay so um yeah I guess let's talk about those um in the paper you sort of have these uh assumptions one 2 a and 2 B MH right which which he talked about a little bit but um I was wondering if you could like more uh Clearly say like what those assumptions are and also how like like in the paper you give sort of natural language explanations of what these assumptions are but I was wondering if you could say like okay how that trans into code yeah um so the first assumption assumption one which was needed across you know both of these uh two situations I talk about is that the planning module is the same in scare quotes again um for similar enough environments so we in my description I assumed that you've got access to this like set of environments in which uh the planning module works the same way across all of those environments that mean yeah it's not totally clear what that means um like I I wouldn't be able to write down a formal meaning of this because if you try to say because there's always the planning module that says well if the environment is you know this specific environment where the ball is on top of the Bas or whatever um then you know out put this policy but if it's this other environment then do this other thing and that's technically a single planning module yeah that like works on all of the environments and so in some sense it's really it's just like is there a reasonable or a simp there's a reasonable or a simple planning module um that's being used across all of the environments and I think this is like the sort of like dependence on reasonableness or Simplicity is there's something that we are going to have to depend on not necessarily in this particular way but like if you don't allow for it you um you get into problems well before that for example like the problem of induction and philosophy which is just like how do you know that the past is a good predictor of the future how do you know how can you know that like how can you eliminate the hypothesis that like tomorrow the you know the the evil God that has so far been um completely invisible to us decides to like turn off the sun like okay yeah sorry but but what but what does that amount to like does that just mean you use like one neural network in like in our code we just use a single neural network uh neural networks tend to be bias toward Simplicity um so effectively it turns into that becomes like a Simplicity kind of like a Simplicity prior over the um planning module all right um I I guess I sort of understand that um so that was assumption one how about uh 2A and 2B yeah so assumption 2A is the version where we say that the demonstrator is close to Optimal um and we don't assume that we have any rewards uh in that one what we do is we take our neural net that corresponds to this planning module and we train it to produce the same things that value itation would produce and value iteration is an is an algorithm that produces um optimal policies for the small environments that we consider um and so by training basically we're just training our neural net to make optimal predictions okay um and you're initializing at this uh at this optimal Network right this training happens in a like initialization phase okay uh like I would call this training the initialization for the like subsequent phase when we then use it um with actual human behavior like this this all happens before we ever look at any human behavior we just like simulate some some environments uh we compute optimal policies for those environments with value attration and then we like train our planning module neuronet to like mimic uh those simulated op optimal policies so this all happens before we ever look at any human data okay and so so assumption two way essentially comes down to like when we train our um networks to mimic humans we're going to be initialized at this trained at this demonstrator that was trained on optimality correct um so one one question I have is like uh why initialization seems like kind of a strange way to use this assumption like if I were being like my default is to maybe be kind of B and say like okay we're going to have some sort of Prior um or like maybe we're going to do this like regularization thing where I know what the weights are of an optimal planner and I'm going to L2 regularize away from those weights yeah initialization you know the strength of that like quote unquote prior or something that you're putting on the model is going to depend a lot on like how long you're training what your step size is and such so yeah why did you choose to use initialization I really that was just the first approach that occurred to me and so I tried it and it seemed to work reasonably well I think so I think I had I don't think I had ever considered regularization that seems like another reasonable thing to do and does seem like it would be easier to control what happens uh with that prior so that does seem that does seem like a better approach actually um I think I would lean away from the bean perspective just because then you have to design a hypothesis space um and so and the whole point of this is too sorry oh I I mean I mean regularization is you know secretly patient anyway right so sure yeah fair enough um I mean I would say like I I wouldn't be surprised if this initialization was also secretly basion given the like other hyper parameters used in the in the training okay um all right so that was 2 a uh then there was also assumption 2B right yeah so assumption 2B is pretty straightforward um it just says that you know we have this set of tasks um over which we're assuming the planning module is the same um and for half of that those tasks we assume that we know what the reward function is uh and the way that we use this is that we basically both the plan module and the reward function in in our architecture is trained by end to end gr by by gradient descent um so once when we have assumption 2B when we have the reward functions we set the reward functions and the human behavior and we freeze those and we use gradient descent to just train the planning module and this lets the planning module learn what the biases are uh in planning and then once we have the plan once we after we've done this training the planning module is then Frozen and it now has already encoded all the biases and then we use gradient descent to learn the reward functions on the like new tasks for which we don't already have the rewards and so there you're just infering the reward functions um with uh given your already learned model of what the biases are okay um yeah I guess I guess one comment I have on that assumption is initially teams on well initially it seems realistic like sometimes people are in situations and you know what kind of what they want and then you think about it a bit more and it seems unrealistic because like you're assuming you know exactly what they want but I think it's a little bit less unrealistic than the second phase thinks for instance one one cool um research design you can do in microeconomic studies is to have lottery tickets that used to pay people with right and like the nice feature about lottery tickets is like if you assume that people want the lottery tickets more than they want nothing a nice feature of lottery tickets is that if you want like if you get two lottery tickets that's like exactly twice as good as having one lottery ticket uh I guess because by the linearity of probabilities and expected utilities you know yeah so there are some situations in which you can actually make that work huh I just want wanted to share that research design I think it's quite neat um wow I love that this is a way to just like get around the fact that utility is not linear in money right that's cool yeah it's excellent unfortunately you only have so many lottery tickets you can give out right so you can't do it indefinitely like at some point they just have all of the lottery tickets and they won the lottery and you can't give them anymore until then yeah all right so um uh I want to jump around in the paper a little bit um mhm so the question I have is in the introduction you spend you spend quite a bit of time saying like all the like strange ways in which humans can be biased and like suboptimal or something reading this I I almost think that this might be a good argument for modeling humans maximally entropically using something like the boltim distribution because there you're just saying look I don't know what's going on I'm I'm going to have no assumptions um but you know then I'm just gonna I'm just going to use that probalistic model that uses the least assumptions and in practice it does all right so I guess I'm wondering what do you think of this as an argument for Balman rational models yeah I mean so I think I I want to note that the like actual maximal maximally entropic model is like one that just uniformly at a random chooses an action which is in the boltzman family uh true but if you did that you'd never be able to learn about the reward because the human policy just by assumption does not have to do with the reward um but so so you need like sort of I actually mostly agree with this now um I am not entirely sure what I would have said you know two and a half years ago when I was working on this but I I mostly agree with this perspective where like what you need out of your model is it needs to like assign a decent amount of probability to all the actions and it also needs to rank like actions that do better as having higher probability and like that's it those those are the important parts of your model and if you if you take those two constraints it's like a b the bolman rational model is a pretty reasonable model to to come out with um and I think like you would expect I think but I'm not sure that this well it should at the very least hurt your sample complexity um in terms of like how long it takes you to converge to the right re word function compared to if you knew what the biases biases were it also probably makes you converge to the wrong thing when people have systematic biases um because you sort of attribute like when they make systematic mistakes you attribute it to them like like in some sense the generative model that bolman rationality is is um suggesting is that like when humans make systematic mistakes a bolman national model is like well I guess they like every single time they're in this situation when they flip their coin to decide what action to take the coin comes up tails and they take the bad action it's just like a weird generative model um like I mean the actual the actual model would be that the human prefers that action in that situation right or that would be the actual inference that a boltzman correct model yeah so so either it would make the wrong inference or it would have if it somehow got to the I think more I'm saying is like if we assume it's you might expect that the bman rational model would not get to the true reward because if it were at the TR reward it would be having this like weird generative model um and so yeah so it would like make the opposite inference where it would be like anything that humans systematically make a mistake on it would expect that humans wanted to do that or like some reason at least we would have to invent some explanation via which humans want that that action was a good one that's an interesting thread I might talk about I might get a little bit more into the nuts and bolts of the paper before you I pick up on that thread more sure um another question I have is uh when you're when you're doing these experiments when you're learning models of the human demonstrator use value information sorry value iteration networks or vin right um could for our audience could you say uh what is a VIN yeah so um a value iteration network is a particular architecture for neural networks that is able to that mimics the structure of the value iteration algorithm which is an algorithm that can be used in AI to learn optimal policies for tabular Mark decision proc processes um point is it's and the it's an example of a neural net whose archit architecture is good for learning a learning algorithm which is the sort of thing that you want um if you want to have a planning module okay uh and and it works in Grid worlds right yes I believe people have used it elsewhere too like I want to say someone adapted it to be used on graphs but I used it on grid worlds and the basic design is definitely meant for grid worlds sure um it like basically Stacks a bunch of convolutional layers uh and then uses like Max pooling layers to mimic the maximization in the in the Val in the valuation algorithm sure so one I guess one question that I have is like it seems my my f memory of value in iteration networks is that they can express like the literally optimal value iteration right that's correct so if that's true why bother learning you know doing gradient descent on optimal Behavior to get a model of an optimal agent rather than just like setting the weights to be the optimal thing um there was a reason for this I'm I'm struggling a bit to remember it I'm guessing I mean partly it's that you know actually setting the weights takes some amount of time and effort it's not it's not a trivial uh trivial there's not a trivial way to do it uh it depends a bit on your transition function depends a bit on your reward function like it only becomes it like it it's able to express literal optimal value iteration when when the Horizon is long enough which I think it was not in my case okay and also I believe you might need multiple convolutional layers in order to represent the transition function and the reward function but but I am not sure it's possible that those two things happened because the Horizon because of the Horizon issue um okay yeah I did I did at one point actually just like sit down and figure out what the what the weights were to encode optimal value attration because I was very confused why my value attration network was not learning very well and then I like found out that oh I I can't do it with my current architecture but if I like add an additional convolutional layer to the like part that represents the reward uh then I can so I added that and then even the learn version started working much better it was great okay I mean I mean I mean if you knew that adding this could let you express optimality why I don't know I I guess I have a vetta against people learning things um there there is a there against people training models that learn things I'm in favor of human learning clarify um there was a reason you saying uh the the version that I wrote computed the optimal policy but did not compute um like the optimal explanation for human behavior under like so a it wasn't bolman rational um okay and B it's like computation of the Q values is like kind of sketchy I don't remember if it was that like I don't think it was actually the right Q values um and I don't remember why it might just be that it like didn't work for bman rationality but you'd get the case where it' be like if you know up was the correct action You' get like up having a value of like 3.3 and then like left having the value of like 3.25 and so on and so like you get the right optimal policy which was the main thing I was checking but like you wouldn't actually do very well according to the loss function I was using Okay um so I guess the other half of this question is um what biases like you had this list of biases at the start um are value iteration networks able to express these biases and in general what kinds of biases are they able to display uh that's a good question um I mean in some sense the answer is arbit anything like they can express arbitrarily like their neural Nets I mean your value iteration net Works they had a finite width and a finite depth right um yeah my value ration networks hard to say um I think they could definitely Express the underconfidence and overconfidence ones well sorry may may no even even that you have to like compute the amount by which you should be underc confident or overconfident I'm not sure the like there were enough layers to do that exactly I I think in general the answer is the networks I was using could not in fact liter Al exactly compute the biases that I was doing but they would get very very close in the same way that like neural Nets usually cannot just like depending on your architecture they can't exactly multiply two input numbers but like they can get arbitrarily close Okay um so I guess my my this leads to another question um of I guess the setup for your paper you sort of encode bias use as having this like as having your planner be the slightly wrong value iteration Network um at least what at the time of doing inference um but interestingly you assume that the world model is accurate now when when I think of bias like either like cognitive bias or you know the kind of thing that I might read and thinking fast and slow a lot of it is about like making wrong inferences or having a bad idea of's going the world so I was wondering why you chose to have you know bias specifically in the planning phase and focus on learning that yeah um I think like you I think two answers the first answer is you know this was the one that I had some idea of how to deal with all right like Fair second answer um you can view like the ones where you're like you have a bad model of the world you can also view that as your planning module like transforms the true model into a bad model and then does planning with that bad model um this is actually what value iteration networks kind of do like they learn the function um the the weights of the value iteration Network encode what the transition function is um huh I mean or or if I were really yeah yeah if I if I were interested in understanding humans I don't know it seems to be the typical human bias is not well modeled by like somewhere in your brain has like the exact model of how everything in the world works right that seems right um yeah I mean I mean point taken about you've got to focus on something though yeah I think in this case it was actually that the transition Dynamics were the same across all the environments and the value rtion network was allowed to learn a wared version of them okay um which is not the same thing as like when humans look at the world they like misunderstand what the transitions are this is more like when we came up with our model of what the human uh human planner was doing we put into it this like incorrect model of how the world works so that is still a difference but it isn't it isn't like we learned a planner that gets the correct transition Dynamics and then Works them um like that that I I know I said that earlier that I more meant that like the optimization was doing that um or possibly I just said a wrong thing earlier I'm not entirely sure which which I did okay um cool so I guess next uh our listeners are probably wondering ing um what happened like I GA you did some experiments and got some results uh could you describe like briefly roughly what experiments you ran and overall what were the results yeah so it was pretty simple we just simulated a bunch of biases in Grid worlds uh and you know we let's see I'll I'll just look at the paper it was a naive and sophisticated version of hyperbolic time discounting a version uh where the human was like overconfident about probability uh about their actions the likelihood of their action succeeding another one where they were underc confident about the likelihood of their action succeeding um and one where that the human was myopic um so wasn't planning far far out into the future um so we had all of these biases um we would then generate a bunch of environments and simulate some human behavior uh simulate human behavior on those environments so this created a data set of like um environments in which we had human behavior and we also had the ground truth rewards that were used to like create that behavior so we had a metric to compare against uh then we would take this data set and then depending on whether we were using assumption 2 a or 2 B we'd either remove all of the reward functions or only half of the reward functions um and give it to the algorithm and it had to predict what the reward functions were not it had to predict the reward functions that we didn't give it um and then it was evaluated on how well it could predict those reward functions in particular it was if we then optimize the inferred reward functions how much of the true reward function would that policy obtain okay so importantly you're you're um measuring the uh the Learned planner reward fun learned reward pair rather than just the Learned reward function no sorry we take the Learned reward function and we optimize it with a perfect planner not the learn oh okay all right so so you're evaluating the word function by if you planned perfectly with it um how much reward you would get which is a pretty stringent standard right because if your reward function well well actually in a tabular setting if your reward function is a little bit off the optimal policy only gets a a little bit less reward but in general like this can be a little bit tricky yeah I I it's a stringent setting in the sense that like in in at least in the environments we were I mean it's kind of stringent and kind of not stringent like in in the environments we were looking at it mostly mattered whether you got the highest reward grid World entry correct because that was the main thing that determined the optimal policy it was not the only thing it was the main thing um so you needed to get that correctly correct mostly um but it also mattered to know where the other rewards were so that because like if you can easily pick up a reward on the way to the like best one that's often a good thing to do sometimes um oh I forget if this was actually true in the final experiments that we ran but sometimes uh if the reward is far enough away uh you just want to stay at like maybe slightly smaller but closer reward and just take that instead um but yeah for the most part you just you mostly need to predict where is the highest reward in the SK World okay um and and what kind of results did you get yeah so um if you don't make any assumptions at all well if you make if you take assumption one but not assumption two or to be the ones that let you get around the impossibility results um so you don't basically you just don't don't run the initialization um where you like make the uh neural net approximately optimal then you just get like basically zero reward on average um so that was the first result was like yep the impossibility result it it really does affect things you you do have to make some assumptions um to to deal with it uh then for the versions that actually did use assumptions um we found that it helped relative to assuming either a boltzman rational model of the human or a perfectly optimal model of the human okay but it only helped like a little bit and this was only if you like uh controlled for using a differentiable planner uh sorry by from using this like planning module because like it introduces a bunch of approximation error so so when I say we compared to B uh having a bolman rational human model I don't mean that we like used an actual bolman rational model I mean we simulated a bunch of data from bolman rational model trained the the planning module off of that data um and then uh use that trained planning module to infer rewards um and this was basically to say well differentiable planning modules are like not very good we want it to be consistent across all of our comparisons um but that's obviously going to like not hobble the like well it's not going to hobble it relative to the others but if you were just going to assume a bolman rational model you wouldn't need to do this differentiable planner thing and so you would do better okay so so basically you said like assuming either 2A or 2B did really help you did really help you a little bit compared to assuming neither but using a differentiable planner but using a differentiable planner is like quite a bit of loss that's right um okay um and do you have info about like how 2A and 2B compared if you like only wanted to use one of them yeah so you you should expect that 2A does quite a bit I mean I okay I expected that 2A would do uh sorry I forget which is which I expected that 2B would do quite a bit better than 2A so like if for our listeners you might have forgotten what 2A and TP are yeah I was about to say um if you knew like I thought that it would help significantly in the uh to like know what the reward functions were um so like you know half the reward functions in let's you ground your learning of the biases um you thought that would help more than assuming than like training your learned model starting from optimality yeah because like in some sense you you there's like there's now some sort of ground truth about the biases there there's like a good learning signal there's like no impossibility result that you're trying to to navigate around it seems like a much better situation and it it did do a bit better but only a little bit better I was like surprised at how how only a little bit better than what uh if you had assumption 2A instead where you just initialize to optimality okay so so slightly better it's the rewards compared to assuming that it was close to Optimal but not actually very much better is what you're saying yep that's right um okay yeah and like really it's like it was like a lot better in two of the conditions that we checked in a little bit worse in a couple of the conditions we checked in and on average it washed out to be a bit like a little bit better okay all right that's interesting um so one thing I that's interesting to me about your results section um if I read section 5.2 it comes off as more scientific than my machine learning papers I want to say like uh it just in the sense that it seems to be interested in carefully testing hypotheses and uh in the paper you have these headings of um manipulated variables and you know um dependent measures and uh various comparisons um so I guess I'm wondering like uh firstly why did you adopt this slightly non-standard approach and maybe um related to this what do you think of the scientific rigor of mainstream machine learning oh man controversial questions uh start with the easy one yeah why do you adopt that approach uh I mean the like actually correct answer is I am advised by Ana and this is uh Ana dran is a professor at UC Berkeley she is one of my advisers um and you know the of my advises she had the most input into this paper um and she's very big on doing you know more like scientific um experiments more in the style of like normal scientific experiments as opposed to the typical ml experiments so so the first answer is because my is ana um but I like I do agree with Ana about this um where like there's just like if there definitely is a point to the normal ml way of running of doing experiments where essent where this is an oversimplification but like the point is basically to show that you do better than whatever previously happened like it does the this structure does lead to um significant progress on any metric that the that is deemed to be something that you can do this sort of experiment on um and so it like does tend to incentivize a lot of uh progress on in cases where we can like crystallize a nice metric I am less keen on uh the these sorts of experiments though because I like I don't see the main problems in AI research as can as we have these metrics and we need to get like higher numbers on them I think that there are like much more all of the things in AI that are interesting to me even if we set out set aside AI safety and particularly look more like oh my God what's going on with deep learning it's got all these like crazy imp I facts like that I wouldn't have predicted a priority what's going on there can we try to understand it um or it like those are the vein of the types of questions that I'm interested in and for for those sorts of questions it's if you are running an experiment like you would run experiments to like learn new information if you already know how the experiment's going to come out it's kind of a pointless experiment to run like the point ofun running that experiment is to successfully publish a paper and I've done my share of that but those aren't the experiments that I'm usually excited about all right um so I guess now that we're getting more philosophical um my understanding is that you think of yourself as an AI alignment researcher or an AI safety researcher or something is that right yes okay so how do you and if you do how do you see this paper as fitting into um some path to create safer aligned AI yeah so I think there's like the path that I mentioned way back when at the beginning of this podcast um so that was the idea that we were going to learn some like utility function and just just like whenever we had a task we wanted an AI to do we would like have a human do that task and then have the or or or what does what does that even look like I'm yeah could you say more yeah I mean I'm not particularly optimistic about this path myself um but it's not a an obvious or like I feel like while relying just on learning a reward function from Human Behavior Uh that can then be just perfectly optimized I I think I'm fairly confident that that will not work but it seems likely that there are plans that involve learning what humans want and having better methods to do that um seems valuable whether it had to specifically be about systematic biases and whether they can be learned uh I think that part I'm I feel is like less important at this point um but you do need to account for human biases at some point um so to outline it to outline maybe more of a full plan or something you could imagine that we build AI systems and we like they we're training them to be essentially helpful to be like good personal assistance with superhuman capabilities at various things but still you know in thinking the way personal assistants might might do uh in the sense that like they're not sure what your preferences are what you want what you want to happen and so they like need to clarify that with you and so on like this feels like one of the like subtasks that such an agent would have to do like inferring what you want yes exactly I guess if I think about machines that I employ to do things for me um like if I want video conferencing software or or even if I want an if I were to get an employee not that I've done that usually the way I get good video conferencing software is that I do not first demonstrate the task of relaying video from and audio from one place to another very quickly right that's like because I just can't do that I can't even do an approximation of that and similarly like with employees like I I guess there's probably a little bit of instruction by demonstration but I don't I don't think that like that's the main way we communicate tasks to people right am I wrong about this no that seems right I think you're you're conflating the like evaluation that we did with the the technique like the technique is trying to infer what the reward is right so like this is I think the the a better analogy would be like I don't know since it's like sort of assuming optimal demonstrations it's more like assuming that you have like a Magic Camera that like gets to watch the person as they do they go about their day-to-day life and they're like I guess also they're not aware about this camera which is not a great assumption but let's assume that for now so like you just sort of watch the human go around with their life and you're like okay based on the fact that they you know um had cake today I can infer that they would like cake or something but maybe then you're like oh no but actually humans have this like short-term bias so I shouldn't infer that they don't care about their lifespan or their their overall level of Health it could just be that even though reflectively they would endorse being healthy having a long lifespan they um in this sort of in the in the moment uh went with a short-term preference for like nice sweet cake I guess okay I want to this always comes up in these discussions and I want to defend eating cake I feel like you can care about your lifespan and also eat some cake wait that's totally yes that's totally true I'm just saying you don't want to over update on it you don't want to sure but but I mean we already know that I don't know in like like presumably you're a spy camera is also going to see you put on a seat belt right MH like I feel like there's I don't know I I I feel like there's already a bunch of information about you caring about your life and like eating a bit of cake is not like strong evidence that you don't actually care at all about your future lifespan even even under naive even if you assume that people were optimally that's my I know I just feel like all these examples are super biased against cake you know like fair oh we wouldn't want people to think that eating cake is ever a good idea according to human value you know you know okay sure I I I'm I'll stay away from the cake but I think I think I wouldn't say that like like if you assume per uh perfect optimality in that situation I think like you don't learn something like oh humans care about having a long lifespan you learn something more like humans don't want to be in violent accidents but they don't mind like dying of whatever it is that cake causes um right like you can always inject more and more State variables to distinguish behaviors um in order to explain why that why humans seem to like not care about their life in one case but do care about their life in the other case Okay um and like that also a thing you don't want your systems to do so I guess going up a level is the idea that I'm going to like that the way AGI is going to work is like the product that AGI Corp the corporation that sells AGI this product is going to be okay you're going to have this like I know you'll wear it like a GoPro on your head for a while or something um and the system is going to just like learn roughly what you value in life and it's just going to like generically do things to get you more value to to make your life more like you want it to be is that it um plausibly that seems a little bit scary I don't know I don't know want a thing right sorry say that again I I guess it seems to me that like if I want a super intelligent system um I'd rather first like I'd rather have a super intelligent system that did a well-defined task rather than generically making my life better or at least I think AG like when I think about AI alignment it seems like we would be able to figure out the how to create an aligned safe AI that does one task before we figured out how to create an aligned AI that generically makes all of your life better and I I'm kind of on board the arguments that we don't know how to have an aligned AI that like does one concrete task even if like you know that you can still allow problems of vess of specifying like what the task is if the task like build a like thriving City or something but I I I find it weird that like so much of the field is about like generically make your life good I feel like there's not that much of a difference between these like build a city build a like good City seems like pretty similar to build an agent that generically helps me and it turns out that like current in this particular case what I really want to do is make a city like you still want like you would still want these personal assistants to like defer to you and like if you give like explicit instructions to obey those and that like I I think you do want to shoot for that so it does doesn't feel like those are that different in terms of the actual things that they would do um in terms of like research strategies for how to accomplish this I guess the versions where we're trying to build agis that can do these like broad bag tasks while we're like saying okay we need to First figure out how to make an AI that like does this task it seems to me I I I just don't see what benefits this gives how this makes the problem easier than just like build an AGI that's trying to help me so I mean one benefit is well it just seems like like okay let's let's go with the city toask first right um the thing with the city toask is um you're not like like I think the way to do it is not like okay I'm going to watch human try to do urban planning for a month or something and then like I the AI I'm going to take over like like that's not what you're suggesting right is it actually I can't tell no I think so for the city for the city planning case in particular this algorithm is going to be like I mean it might infer like some sort of like Common Sense details but it's not going to infer what a good City looks like because you just don't get information about that by looking at a single human um so you need something else so I I think I'm more like let's leave this paper aside I I'm not at all convinced that it will matter at all for like AI alignment I would not bet on that um all right but I do think the like general idea of like oh the AI system is going to try to help us and we'll be like inferring their preferences as part of that that that is something I'm more willing to stand behind I think in this case it looks more like like the system when you tell it hey please design me this city it like goes around and like reads a bunch of books about how to design cities if such books exist I don't know um it like looks at what previous urban planners have done it like maybe surveys the people who are going to live in the city to figure out what they would like in the city it like periodically checks in with you and says like this is what I'm planning to do with the city um does that seem like good to you um and if you know if I then say like this is the reason that's bad this doesn't seem good to me for X reason it can say like well I chose that for y reason because I thought you would prefer Z but if you don't then I can switch it to this other thing um okay I'm maybe rambling a bit here yeah so so I guess I guess you're imagining okay we're going to have like these a systems they're going to do tasks that are like kind of vague but um the way we're going to the way they're going to do that is they're going to infer human preferences like Bas basically in order to infer what we mean when we say please build a good city and they're going to do that by like um a whole bunch of sources of information um and maybe like one fifth of what they're doing is like looking at people who are trying to do the task and trying to infer like what the task was assuming like they were you know oh doing a good job but trying to do the task I don't know about 1th well you said a lot of things and very few of them sounded like inverse reinforcement learning to me yeah sorry I was imagining much less than one to be clear okay sure all right a small amount I'm happy with yeah less than or equal to one sure okay yeah and and I guess this gets a bit into other work of yours that you've collaborated on that we won't be talking about right now but um I guess I can kind of understand this but Tazer a question you asked me a little bit earlier I think like the reason the reason that I want to do the like plan a city rather than generically make my life better firstly I think I want to be clear that like like if indeed we are trying to create an AGI that like can plan a city rather than generically make your life better um if somebody says that out loud then we can check if where our research helps with that um um so so that's one reason to be a little bit clear about it um the reason to prefer planning a city to generically make your life better is firstly um intuitively it seems like an easier job like if you're planning a city the the way you're generically making my life better is by planning a city that's good if you're generically making my life better you're generically making my life better in every single way including presumably the way of like maybe occasionally planning a city to make my life better so must stly easier to do the first thing um and then third um I prefer the plan of City type task because at the end you can check if a city got planned and like roughly evaluate the city and see if it's a good city and like that seems like a Target that you're going to know if you hit it more easily than you're going to know if this AI system generically made your life better I'm wondering what you think of those points yeah um I think I didn't understand the first point U but maybe I'll talk about the second and third first and then we can come back to the first point well I'll say the first point may maybe it wasn't responding to anything but I I think that like the first point is just we should try to be clear what problem our research is trying to solve and I think that the and and yeah I guess the second and third Point are arguments that like planning a city and generically make your life better are like importantly different problems about why one of them is better than the other yeah I mean first point seems good being clear about what we're trying to do seems good uh I it's so rare in papers right it is uh it's annoyingly difficult to get papers that say exactly what we're trying to do to be published very sad um I I tried with NE we'll see we'll see whether or not it works um okay but for the second point I agree that in if you yeah I agree that the like planning a city type uh task must be strict like should be strictly easier if you're imagining that your help me generically AI system could also be asked to like city um I think that's like more of a statement about capabilities though where I'm like okay uh but like I I sort of see the like safety the alignment the like benef the like good properties that we're aiming for here in the I safety and Alignment coming from the like trying to help you part um and like we can have different levels of competence at helping so like maybe like initially we just have agents that are trying to help us schedule meetings on a calendar and you know not doing anything beyond that because they're like not competent at it and you know as we get to more General agents will like need to ensure that these agents know what they can and cannot do uh so that they don't try to help us by doing something that they are incompetent at where they just like ruin things without realizing it um and that that's like one additional challenge you have here but but I sort of see this as like most of the safety comes from the agent being in this like being helpful in the first place and that's the reason I'm aiming for that instead of the like things like plan a city um sure the remind me what the third point was um oh the third point was that uh you can check if you've succeeded at planning a city more easily than you can check if you've generically like had your whole life being a bit better yeah I guess I'm like I don't see why that's true like it seems like I totally could evaluate whether my life is is better as a result of having an this AI system like maybe the AI system like tricks me into thinking my life is better when it's not actually but like the same thing can happen with the city um I mean to me it seems like building a city is better defined task like uh H I guess there are so many ways like there just yeah maybe this is wrong it just feels like there are so many more ways in which my life could like plausibly be better but at least with the city I don't know it feels like I can check if it's fa or not yeah I mean I think so like it depends a little bit on how how how you're going to quantify The Helpful part like maybe you're just like you know as like one metric like did the AI system follow the instructions that I gave was it competent at those instructions did it like infer something that I wanted without me having to say it um or something like that but I feel like we can or um I I I would guess that at least like all the people who hire personal assistants are in fact able to tell whether those personal assistants help them or not and like can Pro hopefully they can distinguish between like bad and good personal assistance seems like I mean I think that's because they give the personal assistance specific tasks like uh please do my taxes we plausibly will do that too but like I mean there like even our personal assistance can take do like often I expect take a lot of initiative I guess one example from for me personally is like um you know I read the alignment newsletter uh and recent I guess not that recently anymore for quite a while now it's been published by somebody else um and specifically gorg from the future of humanity Institute um also Sawyer from Barry helps run it as well and like at some point I was like you know we should probably switch from this like plain H plain like pretty plain template um that Machamp has to something that has a nicer design um and I like mostly just said this and then like Sawyer and B Sawyer and gorg just sort of did it periodically sending a message being like this is the plan and I'd be like yep THS up and then like at the end of it there was a design it was great and like I like in some sense I did specify a task uh which is like let's have a pretty design but it really felt like it was like a fairly fairly vague kind of sort of instruction but not really um that they just then took and executed well on and I like sort of expect it to be similar for a I think that's fair um yeah I guess the other thing I wanted to pick up on was um in your answer it sounds like uh it sounds like you think there's this core of like being helpful that like there's there's some like technique to be helpful and you know you can just like be better or worse at it and once you know how to be helpful you can be helpful at like essentially anything um mhm that that's my interpretation of what he said I'm wondering like to what extent you think this is right and that like you see the important part of the AI alignment or safety Community as like trying to figure out computationally what it means to be helpful um that seems broadly correct to me I wouldn't I think I wouldn't say that it's um the entire a alignment community's thing that they're doing like I think there's a subset that cares about this and that difference upet does other things oh and my question was whether you think that's what they should be doing uh whether that's what the problem is I think there's like some meta level outside view that's like oh man we should not be we should be encouraging diversity of thought or whatever but if you were like what is your personal thing that like seems most promising to you such that you'd want to see at least the most resources devoted to it yeah I think that I think it's right to say that that would be like AI that is being that is like trying to help you um or trying to do what you want is how I think Paul Cristiano would phrase it um okay and yeah it sure seems like there are a lot of sub problems of that like if I imagine that being my main thing it's like oh how much have I made my life easier I like but it does feel like this is like a domain independent core it's got like a domain dependent core or something like like if you look at um assistance games which were previously called Cooperative inversing enforcement learning games or maybe just Cooperative inv re enforcement learning um like I feel like that is a nice crisp formalization of what it means to be helpful it's still making some simplifying assumptions that are like not actually true but like it really does seem to incentivize quite a lot of things that I would characterize as like helpful skills or something like it incentivizes preference learning it incentivizes you know asking questions when being unsure in like it like will have you like like and it incentivizes like not um asking questions only when they become relevant and not like asking about every possible situation that could ever come up at the beginning of time um it incentivizes like learning from human behavior like passively observing and learning from human behavior sort of thing we were talking about before so I don't know it feels like this is a thing that we can in fact get agents to do in a relatively domain independent way and if we succeeded at it then there would not be existential risk anymore okay well on that note uh I think we've had a good conversation um hopefully our listeners understand the paper a little bit better but uh of course I would recommend reading it um the name of the paper is on the feasibility of learning rather than assuming human biases for reward inference um today's guest is being Rohan Shaw uh Rohan if you want if uh viewers wanted to follow your work um what what would what should they do yeah I mean the most obvious thing to do is to um to sign up for the alignment newsletter uh this is a newsletter I write every week uh that just summarizes recent work in AI alignment including my own um so that's a good place to start it's also available in podcast form um other things I like write a bunch of well I write some stuff on the alignment Forum uh so you could go to alignment forum.org and look for my username Rohan just search for Rohan sha um and I think the last thing would be you know the papers that I've written um which are all available or or links to them are all available on my website uh which is Rohan sha.com you can also find a link to sign up to the alignment newsletter there all right uh thanks for today's interview rean yeah thanks for having me

Related conversations

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

AXRP

11 Apr 2024

AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -9 · 174 segs

Future of Life Institute Podcast

7 Jan 2026

How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -3 · 85 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.