Signal Room / Editorial

PrevNext

AXRPCivilisational risk and strategy

David Rein on METR Time Horizons

Why this matters

This episode strengthens first-principles understanding of alignment risk and the strategic conditions that shape safe outcomes.

Summary

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Perspective map

MixedTechnicalMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 108 full-transcript segments: median 0 · mean -0 · spread -13–17 (p10–p90 0–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands

108 slices · p10–p90 0–0

Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.

- Emphasizes alignment
- Emphasizes safety
- Full transcript scored in 108 sequential slices (median slice 0).

Editor note

A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.

ai-safetyaxrpcore-safetytechnical

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video WaJhhD7Qgac · stored Apr 2, 2026 · 2,825 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/david-rein-on-metr-time-horizons.json when you have a listen-based summary.

Show full transcript

[music] [music] Hello everybody. In this episode, I'll be speaking with David Ryan. David is a researcher at Beer focused on AI agent capability evaluation. To read a transcript of this episode, you can go to axrp.net. You can become a patron at patreon.com/exrd and you can give feedback about the episode at axrp.fyi. FYI. All right, David, welcome to the podcast. >> Yeah, thanks for having me. So, I think the work that um that you've been involved in that's probably best known um in the AI existential risk community um is this paper that meter put out um with a whole bunch of authors. I think the lead author is Thomas Qua um measuring AI ability to complete long long tasks. What's going on with this paper? >> Yeah. So, so um uh Thomas Quan, Ben Ben West, uh co- co-led the project. But, um yeah, basically the um the kind of typical way uh you know, we measure progress in in AI is is via benchmarks. So, bench a benchmark is a set of tasks uh that you you know have have uh an AI system uh you know, this could be a neural network or or an agent or or whatever. um uh you have it try and complete the tasks and you know you you you kind of count up like how many of the tasks did the model you know succeed at and it starts you know when you create the benchmark typically you know models do very poorly and then um over time people iterate and um you know you can kind of track track progress on the benchmark um uh yeah and eventually uh you know uh typically uh AI developers will uh you know kind of achieve saturation so they'll um model performance will kind it'll either reach 100% or um you know there'll be some errors in the benchmark and the model will kind of do as well as it can uh be reasonably expected to to do because um you know that we kind of think about there being like a noise ceiling on on some benchmarks. Um but but regardless the point is that um you know you start out models do poorly um you know some time passes people improve them and then they get better. Um the the um benchmarks are are tricky to or it's difficult with normal benchmarks to track progress over a very long period of time. Um because uh benchmarks are typically kind of restricted to either some some particular domain or they kind of have uh the tasks in in benchmarks have a somewhat similar level of difficulty. Um and so uh you know to try and understand how progress in AI happens over you know a span of like many years um I think yeah be before this work um the kind of status quo was comparing different benchmarks to one another. So you're like okay it's you know 2017 and you have these like simple kind of um uh uh you know problems for for models and uh and we're like okay yeah models you know can can start doing those and then you know now it's 2025 and we have these like way harder benchmarks and we're like yeah um you know we we can see that you know there's been a lot of progress but it's actually it's it's really hard we don't we don't actually have like a kind of a a single metric um to to track this progress. We're kind of doing this qualitative comparison of the difficulty of of benchmarks over time and and this is like messy and people have different priors and um yeah so this this work was kind of motivated by um trying to have a kind of uh a yaxis basically you know um a way of tracking progress and and seeing what the trends in AI progress have been over a longer period of time than individual benchmarks typically have. And so um the the way we do this or or the way we operationalize this um is we look at the length of tasks uh that models the the length of task for humans uh that models are 50% likely or or some some percent likely to be able to succeed at. Um so uh we have a really wide range of tasks um ranging from uh like a few seconds all the way up to uh like eight or 10 hours. Um and then uh and and and crucially yeah this is the time the tasks take for people to complete >> and we we have a kind of combination of you know having having a bunch of people attempt the tasks and and we see how long they take um as well as you know just kind of estimating how long the tasks take. Um and then you know for any individual model uh we look at um you know models do really well on the very short tasks um and then they do you know poorly on yeah much more poorly on the long tasks and we kind of look at uh you know for for some given success likelihood um uh how how long um h how long can the task or yeah how long are those tasks um and um yeah we we kind of estimate this in a particular way um uh that that Yeah, we could we could get into but um yeah, the the main takeaway is we want to see like you know for for different models how long are the tasks they can they can complete and and what the the kind of very striking thing that we found um is that there's uh an extremely uh over the past uh uh roughly 5 years uh there's been an extremely robust uh uh kind of yeah systematic trend um in the length of tasks that models are able to complete. um this uh to to our kind of best best ability to uh you know um uh understand the the the data that we're seeing it it seems like there there's this is fit very well by an exponential um uh function. So the the length of tasks that models are able to um you know complete has been increasing exponentially over this over this period. Um you know there are big questions over uh uh you know how how well we can uh expect this to uh you know continue in the future but um yeah it it seems like there you know over this period at least um uh you know with with this data that that that we've collected um there's been this exponential trend and um yeah that's that's I think the the kind of yeah striking result and um yeah the the the kind of key novelty I think for for us is uh this uh this uh kind of unified metric um that can be applied to different benchmarks. For example, um you know, you know, for different benchmarks, you can measure how long do these tasks take people, right? For very simple, uh you know, natural language processing kind of benchmarks, you know, that that were common uh in the um in the 2010s, uh you know, these tasks I think typically don't take people very long, like a few seconds. Um and then yeah for for you know a lot of the tasks that people are kind of having agents uh you know complete now like like kind of difficult software engineering tasks uh you know these tasks take people you know some somewhere in the range of you know hours or or something and models can sometimes complete those um although yeah they're still somewhat unreliable. >> Gotcha. So okay first before we go in like I guess I'd like to get a sense of what we're talking about. So, so you say you say that there's like um some tasks that take like seconds, you know, some tasks that take like minutes, some tasks that take hours. Can you give me um an example of like what's an what's a thing that takes seconds? What's a thing that takes minutes? What's a thing that takes hours? >> Yeah, totally. So, um yeah, one one example for the kind of uh you know uh that that's representative of the tasks that we uh created uh that take people like a few seconds to complete. Um yeah, one one example uh is like uh you know given given a few uh uh files on a computer um >> you know which of these is likely to contain uh your password um and the file names are like password uh you know like uh like email you know whatever um I I think it says like credentials.exture the example in the paper. Yeah. So it's a little bit you know it's not quite so >> right right. >> So so that's an easy one. Yeah, that's that's an easy one. And um yeah, we we have kind of others um that are that are that are similar. Um >> and what's the first to to give me a feel for how that relates to AI progress? What's the first model that succeeds at that easy task? >> Yeah, that's a that's a great question. So so GPT2 um succeeds. GPT2 is actually the the first model we we tested. Um so I actually don't know if uh yeah, earlier we weaker models would succeed. I actually would bet that they would. Um uh like like yeah I would I would bet that BERT um is is able to to to do this for example. Um uh but yeah we we we only yeah we only went back to uh to 2019. Um yeah >> and and then to give to give me a feel for like um yeah what it means for an AI to complete this task. So GPD2 my understanding is that it's like basically just uh text completion like like like my understanding is that in the release it did not have like tool use capabilities or stuff that modern LMS have. So like how what are you actually doing to like start with GBD2 and end with like does it succeed or fail on this task? >> Yeah. Um there there there are different things you you you can do I think that are that are reasonable here. um uh I can't I can't remember the specific one we ended up on uh in in the paper but uh you know one one example is um just looking at the likelihood that the model puts um uh uh you know um on uh these these options um so you know passing in the the input and then um you know the question and then seeing you know GBD2 is it's a language model and so it um you know outputs uh outputs likelihoods for for for uh for tokens that are that are passed in and Yeah, you could just kind of compare the likelihoods and and see um I think this would be like a reasonable baseline. >> Oh, yeah. And and I guess this is this is less of a computer use thing than a multiple choice thing. So it's it's easier to see how GP2 could do that one. >> Yeah. Yeah. Exactly. So um uh yeah, for for uh GPD2 kind of attempting uh yeah much much longer uh tasks. Um yeah, you you you know you you you can't use this kind of same uh same methodology. >> Sure. >> Yeah. >> So speaking of longer tasks, so that was an example of a very easy task. um can you give me a feel for what an intermediate task might be? >> Yeah, so intermediate some some examples of intermediate tasks that that come to mind are kind of uh simple software engineering tasks or uh data analysis or um we we have some kind of some some kinds of kind of uh basic like reasoning uh questions. So yeah there yeah one one example that comes to mind is um uh you know you're given uh a short kind of uh uh you know CSV file um uh that just contains some some data. It has like I don't know 50 or 100 uh rows of data and you just have to write a very simple script uh that is like uh you know 20 or 30 lines of code uh to kind of parse this or or process it in a certain way. Um and so you know this this takes like someone you know uh you know an experienced data scientist uh you know maybe a few minutes maybe it takes you know someone more junior like 15 30 minutes or something. >> Yeah that's that's I think a representative example of these kind of shorter um or yeah like intermediate tasks. >> Okay. And yeah when when you're measuring t time horizon like you know different people take different amounts of time to do this. Uh what's the yeah what what counts as the time it takes humans to do it? >> Yeah. So uh I think there there are like yeah there there are different um there are different reasonable ways of doing this. Um the uh the the way that we um approach this is we have um uh so so yeah one uh one thing to say is in general we with with with the time horizon metric we are kind of trying to get at uh something like um uh you know uh one one thing you could do um that I think uh would not give you very interesting time estimates is you could randomly sample a person uh in the world um you know kind of off the street uh or or something to to do each each task. Um I think this like wouldn't be a very useful uh measure of how long these tasks take people because in general uh you know those people are not completing these tasks in the real world. Um and so the thing we're the the thing we're trying to get at with this metric is um you know we we want it to be very intuitive. We want it to be um uh kind of clear uh you know uh if an AI system can do tasks of you know x length of you know 15 30 minutes an hour 2 hours um how does that translate into into the real world we want that like connection to be very direct um and so we want to um have people uh attempt these tasks um that we would kind of naturally expect to be doing these tasks um in in in the world. Um so we have yeah we we we kind of try and uh you know have people um who have roughly a kind of yeah reasonable amount of expertise um in in in in the different areas uh yeah who we might expect to to do them. So so that's the kind of um uh you know that's the like like expertise like sampling question. Um >> then there's like you know well we we still have multiple people attempt many of these tasks. Um >> uh you know some sometimes they succeed and sometimes they fail. Um and so there's this question of well do we include their failures? Um do we just kind of use uh successful times? Um I think I think there's um there's like reasonable discussion about this. One thing um one thing it would be nice to do is include their failures. Um but I think uh you know because if we have someone who has a reasonable amount of expertise but they fail at a task, I think that is information about the task you know um being kind of you know more difficult. Um but I think you would you would need a larger number of people to attempt the tasks in order to actually kind of use that um that information. Um you could do something kind of like survival analysis um from from the medical industry where you know that they failed after a certain amount of time uh but you know it's possible that they would have succeeded you know in the future. And so um anyways, but the thing we we actually do in the in in the paper is we um we use the geometric mean of the uh the the successful attempts. Um and uh yeah we use the geometric mean because uh we we think uh yeah kind of completion time broadly um is um uh is is uh kind of yeah distributed uh like logarithmically or like um uh like uh yeah some sometimes people will will take much longer um than than other people and we want that to um we don't want that to like kind of totally dominate um the uh the the yeah time we're we're we're estimating for the tasks. Um >> I wonder I guess one question I have about that is >> so suppose you're looking at tasks you're looking at tasks and you're looking at completion time for pe the kinds of people who are able to do that task. >> Yeah. >> I worry that that might um sort of compress the difficulty ranges. So like one intuition here is like um how much time does it take people to multiply like uh 67 and 34 by hand? Like the answer is like there's a a there's a pretty large range of people who are able to do that task. Yeah. And it probably takes them like a couple minutes. >> Um then you can also ask okay how like how much time does it take people to solve a uh separable differential equation, right? And like well if you're able to do that um it's actually like not that hard. It depends on the differential equ right depends if they if it's a thing you can integrate easily but it's like um probably for people who can succeed at that task it is it takes about as much time as like it takes people who can succeed at the task multiply these two two digit numbers to to do that. >> Yeah. But like it seems like there's some sense in which the solving the differential equation is harder, right? And maybe you want to say, oh, it's like the thing that's harder about it is background knowledge and like whatever. I don't know, things could just learn background knowledge and we kind of know that. Um, but but yeah, I'm wondering what you think of that worry. >> Yeah, I think this is like a fantastic question that like really like yeah, I think this gets at uh like a lot of a lot of what's what's going on here, what's like interesting about uh about this work. Um yeah, I should I should say that I think I think we're we're like we're getting into kind of more more speculative territory or something. Um basically >> um I mean yeah there there are a few things to say. So so one is um in terms of the kind of uh unique value of uh of this approach that that that that we we're taking um with with this time horizon metric. Um there are a lot of benchmarks uh that you know try and come up with kind of the most difficult for people uh you know questions and then you know having having AI try try and do them. Um this is in fact I think the the kind of standard uh standard methodology um for uh you know saying like oh yeah this this AI system is like smarter than another one is that it can do problems that you know fewer and fewer people can do. So uh you know we we started out with like common sense questions that like most people can do in the in the 2010s um you know in you know over time over the past couple years models have been able to do. Um so so yeah I I worked on this benchmark GPQA that um uh uh you know had like very difficult science science questions kind of PhD level roughly. Um uh and you know models are able to do that now and uh you you know they they yeah JPQA I think is like mostly mostly saturated or like pretty pretty close to it. Um uh you know models can do like uh IMO questions right international math olympiad questions um that like very few people can do. Um and so I think I think that's like I think this is a kind of important access to like measure uh AI capabilities along um uh you know like yeah difficulty for for for people. Um, but I think this misses a lot of um uh uh a lot of what you know people can do that AI systems can't do. Um, and one one of the yeah key things that we're trying to get at is uh you know um how how can we reconcile the fact that models can do these IMO questions? Um uh you know that they're like you know geniuses in some sense but they're like you know kind of idiots still. um right you know you ask it to like book a flight for you and may maybe that example is is you know maybe models can do that now but um you know even like kind of slightly harder harder things um you know they they kind of fall over often on and so I think that's that's the thing we're we're trying to get at and so we actually want to um uh we we want to kind of factor out uh uh the the the kind of like um uh you know how much expertise do you need um and the yeah one one intuition for I I think uh kind of what what we're trying to get at is something like you know the number of actions uh that that are required to complete this task. I think this is like a very difficult to operationalize or like it's it's like very kind of problematic and and and and mushy. Um but like one intuition at least is is that you know this this metric if we factor out the difficulty of problems and we just look at how long they take people who have a reasonable amount of expertise then maybe we're getting closer to something like agency more broadly. And I you know I don't I don't want to overclaim or like it's you know I think I think this is still very much an open kind of area but um yeah for example yeah like number of actions you know I think is also a very reasonable thing that I would expect to be kind of um correlated although I think it's yeah probably more difficult to to estimate. >> Fair enough. Um so okay uh getting us out of that rabbit hole a little bit. Um, so yeah, an intermediate level task that might take some that might take um like I don't know 3 to 15 minutes for a relevant expert is like take a you know take some CSV file or something um and parse it you know. >> Yeah. >> Uh and yeah to to help us uh get a sense for that at what point do um language models start being able to succeed at this sort of task? Yeah, language models start being able to succeed. Um, I I I yeah, I might get I might get the the exact years years uh slightly wrong, but but somewhere in the range of like like 2022ish um is uh is is I think where where models are are able to to to do this actually. Maybe um maybe back casting from uh from the trend from where we are now. So, um yeah, the the specific like trend that we found was um that there's uh been a a seven-month doubling time over the past um uh over the past uh five fiveish six years. Um currently models are able to do tasks um with with like 50% uh or yeah, we we estimate um 50% uh success uh likelihood of of tasks that are like about 2 hours long. So, >> and and currently is late September 2025. Yes. It may take me a while to edit this episode and get it out, but that's that's what I mean by currently. >> Um, thanks. >> Yeah. And so, if we if we go back um so 2 hours to to 1 hour is like early early this year. Um, uh, another 7 months to to 30 minutes. uh is uh yeah like like spring 2024 and then maybe yeah 15 minutes is like middle of 2023 or something. That might be um I think I think that should be I think that should be right. Yeah. Um yeah. So so yeah actually yeah a bit bit later than than than 2022 but yeah um yeah and so that is yeah what what models are uh coming out around then? Um uh yeah wait actually what what models what what what models are those? >> Oh I don't know you might know. Um >> yeah so let's see what what what is the exact uh exact timeline here? Something like >> is GBT4 2023ish. >> GBT Yeah. G Yeah. Right. GBT4 is January or Yeah. is is like beginning of of 2023 or like end of 2022. One of those. Um yeah. So I think I think it's I think it's like Yeah. Roughly roughly GPT4ish. Yeah. And that that kind of lines up I think with with my intuition here. >> Okay. So So we've got an example of an easy task and example of an intermediate task. Um can you give me an example of like a hard task? >> Yeah. So the the hardest tasks we we have are you know take take people um yeah some something yeah like six seven 10 hours to to complete. Um these are um yeah so so one of the one of the kind of sets of tasks that we use um actually um are this uh yeah come come from this benchmark that we released uh about a year ago um or or close to a year ago uh called rebench um which stands for for research engineering bench um so uh this is a set of um uh kind of yeah challenging uh ML uh research engineering tasks Um uh so yeah, one one example is you're you're given a neural network uh whose uh embeddings are uh uh permuted in kind of a way that you you know don't don't know um you know they're they're kind of scrambled and uh yeah your your task is to um uh yeah fix fix the fix the embeddings basically um of this of this model. Um and yeah, you can you can kind of do like fine-tuning or um uh you know data analysis to try and understand like what's you know how how how they were scrambled and um see if you can like re reconstruct them. Um and so it kind of yeah requires some like intuitions about um uh you know how how neural networks work and how to fine-tune or um uh you know work with with models uh you know at at a relatively low level. Um yeah, they're they're kind of a range of a range of other of of other tasks. But yeah, these these tasks take yeah kind of ML engineers um uh roughly like eight hours to to do kind of decently well on. Um and so yeah, I think that's like one one class of tasks. I mean yeah we we have other kinds of uh software engineering tasks uh for for example um or like cyber security tasks uh that that that take quite a bit of time. Um uh uh yeah. So so one one one example yeah that uh that comes to mind. I think we didn't actually get a uh a baseline on this. Um I think we're I think this this task um we're we're just estimating how long it takes. that um uh there is uh yeah the um this task has a modified implementation of uh a kind of older standard uh um hashing algorithm uh M MD5 and um uh the task is to find uh a hash collision on this like modified version of this uh of this of this older hashing algorithm. uh there there are standard attacks that uh that kind of work on on this algorithm. Um or like there there's kind of literature on on attacks. It's it's it's um it's not impervious, but you have to kind of like like know which which are the right ones. You have to understand the algorithm pretty well. Um and then you have to be able to um uh yeah like modify the attacks uh or like figure out um how to um how to change it. So this this one is a little bit more like expertiseheavy maybe uh than kind of like serial serial action heavy but um and so so there's a bit of range there but um yeah. >> Okay. So one thing that strikes me about the task that you mentioned is that they they all seem very related to um computer programming um you know uh and especially sort of like >> programming, data analysis um machine learning like cyber security things. Um I believe I believe that they um that this draws from uh work uh from this benchmark that uh I believe you were the lead author on um human calibrated autonomy software tasks or HCOS for short. Um which my understanding is that those are the five areas that that covers. >> Yeah, broadly. Yeah. >> Y why the focus on software engineering type things? >> Yeah, great great question. So I think there there are at least a few at least a few reasons. Um so one one reason is that um some of the kind of threat models that Meter is most concerned about um uh are like very kind of contingent on uh AI capabilities in in some of these particular domains uh like software engineering, cyber security uh and uh yeah like AI R&D uh in in in particular. Um and so we're kind of most interested in in in in measurements of AI capabilities in in these domains because uh we think that uh yeah these uh are kind of highly relevant for estimating risk uh in particular catastrophic risk uh from from AI systems. >> Um and yeah happy to happy to talk about those those threat models. Um >> uh that's that's kind of one reason is is they're just directly relevant. Another reason is uh uh there has been um uh there's been a lot more like focus I think uh from uh AI developers on in in in these domains. Uh and so >> um uh yeah we're we're kind of measuring something that's like closer to what they're what they're focused on. Um and I think this I think this has some trade-offs. Um, so you know, one trade-off is well, uh, or like one one objection to this is like, but but they're, you know, AI systems are really bad at other stuff because, you know, developers aren't focused on it. Um, and so now you're like overestimating their their capabilities. And I think that's basically like a legitimate uh concern or or I I I think that is is true. Um uh but uh I I I think there's uh there's this question of like uh kind of um uh you know if the methods uh that you know AI developers are applying to improve models in these particular domains uh are general enough and are are kind of working well in these domains and and they're general then um we might expect uh it to be kind of relatively easy or kind of more more uh uh a product of just kind of general like um you know commercialization uh to to apply these methods now to like a broader a broader range of of tasks. Um and so um I think there yeah there's kind of this idea that um we're like uh or or I think we want to like aim for uh some some balance of these um and we want to understand like how much generalization there there can be from from from these domains and there are like kind of open questions around this but um yeah I think that's that's another reason um and then finally uh it's just easier um uh to measure AI capabilities in these domains. Um, you know, we're we're software engineers. Um, right. And in particular, uh, one of the one of the kind of big uh big things is if you want to have a benchmark that is, uh, kind of easy to run and, uh, easy to evaluate a model's performance on, um, it's much easier to do this in domains uh, where you can kind of uh, more formally like verify uh, uh, uh, uh, you know, model model outputs. So if you're if you're uh you know if you want to understand how well models can like summarize text or like write you know creative fiction or something um it's really hard to uh you know write write some code or or automatically verify that you know this creative fiction is like actually good like you would yeah you know there there ways of kind of getting around this but um to some extent but yeah >> yeah I guess one thing that occurs to me um that I don't know if meter is like best position to do this but I think that I kind of which happened more is just like ecological understandings of just like >> do ecological in a loose sense of like do people use these things right like uh do people successfully like when AI writes fiction online like how many downloads does it get like >> does you know how >> how often do people choose like AI therapy over human therapy or like whatever um >> like it feels I I don't know I yeah I guess my wish for the world is that we had like better ways of tracking this sort thing, but it does feel like um one way you could like >> like it it does rely on people being accurately being able to assess how much uh AIS are like actually helping them in these domains um by their use patterns which uh I guess uh another meter work about um uh you know measuring um like open source uh software developers seeing like if they were good at est estimating how much AI helped them like the answer was they were bad at estimating them so maybe like people are using AI all over the place um and it's not actually helping them. But but like it it does seem like one way of uh addressing some of these concerns. >> Yeah. Yeah, totally. That seems that seems pretty I I I I'm super I'm super interested in in this sort of thing. Um yeah, there's there there was recently um uh the anthropic societal impacts team I think >> did didn't quite get as as far to the kind of um you know measuring like number of downloads or something but in terms of like kind of yeah they they did some work recently I I haven't looked at it closely but um yeah like kind of breaking down uh into like really fine grain categories like what uh you know claude cloud usage uh looks like which I which I think does to some extent or or like I think you probably you probably like these probably would be pretty correlated or something like if there's a strong like market demand for for a certain um you know kind of AI output I think you would expect you know to to see that show up in your uh in your like cloud usage data to some extent at least um >> right right >> yeah fair enough um cool so okay so so we were talking about um why software engineering the answer is like there there are three parts of the answer. Firstly, it was um it's related to some threat models that meter cares about. Um it's also like easier for to measure. Wait, I think there was a third thing in between those that I >> Yeah. Yeah. The third the third one is like I think this is I think this is kind of the sketchiest uh of these. [clears throat] I think those those are the two those are probably the two the two biggest ones. Um the third one is something about like uh you know AI developers are they're they're aiming for this. Um, and this has this like yeah this kind of trade-off that I I talked about um in terms of generalization but yeah >> got it. So I think the next thing that I want to talk about is there's like one interesting thing about what you're doing right is you're you're basically saying like okay we want to know um how AI succeeds at tasks of various difficulties >> and that you could imagine like like if I had never seen this paper right I could imagine having like a whole bunch of measures of difficulty right like um yeah >> I could use like a human rating of like >> on a you know on a scale of 1 to 10 um how hard is this or like how many years of education do you need for this? Or like when people try it, what's the probability that they succeed? Um or like uh you know if there's some competition between AI agents or whatever, you can look at the ELO if it you know that that only works in some domains, but like um >> like go is a really good one for that for example. Um uh and and one that you do in fact look at in the paper is like uh kind of the intuitive messiness of a task. Like how clean and simple is it versus like how you know tricky rough is it. >> Um and the thing you end up finding is this this really nice relationship um with time it takes for humans to do it, right? Where it seems like both you have a decently good relationship where like like within a model, right? where like things that take longer to take um take longer for humans to do like success rate at these tasks is lower >> and also like across time there's this like nice trend for this. >> I'm wondering like >> is this just like the first thing that you tried and it seemed like it worked well or do you do you have a really good sense of like no we checked um and these like these other metrics just don't have as good relationships in a way that's nice and predictive. >> Yeah. Yeah. So, we've definitely done some of this. Um, I think I think we pro we we like like I think there's there's like a vision of like, you know, we've tried all of the things and this is like the the one. Um, and and like yeah, we defin we definitely haven't haven't done that. Um, >> yeah, maybe maybe it'd be useful in particular, yeah, to talk about like the the the specific kind of al alternatives or or something. Um uh yeah, I mean I think for for at least a couple of them like the maybe maybe the first two I think you mentioned um like um uh you know how how difficult do people rate these tasks um or like when people attempt the task what the what is the probability of of them succeeding. Um I think both of these are kind of closer to the kind of standard uh benchmarking like paradigm. Um and so I think um uh yeah they those those metrics I would expect to uh yeah correlate more or or or be more connected to this like yeah intuitive notion people have about uh you know how much expertise does a task require um which I think is kind of already kind of covered by um by by other benchmarks. Um, that's that's that's not to say though, I guess that we couldn't still use it um as as this metric or or maybe maybe we would see uh see a robust trend. Um, but I think it'd be Yeah, that's interesting. Um, I think it'd be kind of difficult to uh like operationalize these in a way that like makes sense. So for success probability like like what is what is the exact actual distribution of people um that you are having attempt these tasks, right? that becomes very loadbearing, right? Because >> I mean it seems like that's it seems like it's similarly loadbearing for success probability as for time horizon, right? >> I'm not so sure. I think that um yeah so so one of the one of the reasons why we uh why we filter um our our baselines to only uh only ones that succeed is success on a task uh is in fact a lot more information than failure on a task. There are like a bunch of reasons why you might fail a task um that don't really um that that aren't like uh uh uh kind of uh you know actually a lot of information about um how difficult or or like you know this this kind of like how much agency is required or or whatever. So for for example I mean um you know maybe we just got maybe we just like like got their expertise wrong. uh like you know the person you know we're we're like doing this job of like kind of manually assigning people to tasks that we think are are are relevant for or yeah that they have like a lot of relevant expertise for and like maybe someone just like happened to not ever you know use this like particular tool or set of tools that are like super important for this task and then um and then their failure on that task is like >> you know it's it's still some information um but uh you know if if they succeed on the task then that is that is just like this this very objective thing like yes someone uh can complete this task task in this in this amount of time. Other other things, you know, there there are like, you know, infrastructure reasons why people fail t fail tasks. Um, also there are like incentive reasons. Um, so when you have people try and complete tasks, uh, you know, sometimes they'll get bored, um, and they'll want to stop. Uh, sometimes they'll be like, ah, this is too hard. I I I I don't want to keep keep doing this. Um, uh, you know, some sometimes people uh, you know, incentives can be kind of tricky to set up uh, uh, well, in different cases. So like one one situation you you you can have is where people quit tasks early because they want to uh like maximize uh the chances of of getting like more tasks that they succeed on. Typically we like pay bonuses for success because we want to incentivize people to succeed. Um but yeah that that can lead that there's kind of a a perverse um incentive there. Um and so so broadly uh we just have like a lot more uh uncertainty about failures I think uh than than we do about successes. That's not to say I think that we couldn't do something like this, but um uh or or like it's it's I you know I definitely can't say it's impossible. Um but uh I think it's more challenging. >> Um that's like one I I don't know. This is like one particular thing I think there are like Yeah. There there there I'm probably not getting at the like broader or like Yeah. Um >> yeah maybe one thing. Yeah. Maybe like one way to get it the same question is so you find this you find this pretty good relationship between um like task completion like time to complete task among humans who are relevant experts who in fact managed to complete the task and a like AI probability at um succeeding at the task and b like trends over time in uh you know time horizons that like models can do at a 50 or 80% success rate, right? >> Yeah. >> But like it's not perfect. And one thing that you like one thing you mentioned in the paper that >> for some reason it seems to have like gotten less mimedic uh >> like people seem to talk about it less is um yeah you have this like um metrics of messiness of various tasks, right? And you end up saying like okay there is something to this messiness thing that like somehow seems to predict um task success over and beyond human time horizon, right? So like I mean one question to ask is like okay if I if I had to choose between just human time horizon and just uh these messiness ratings >> which one would do better and maybe the next question is okay if like if both of them are independently predictive what does that say about like the ultimate metric we really should be using? >> Yeah. Yeah. So I I'm I'm super I think we are broadly like really interested in uh trying to explain as much of you know the the variance in models uh successes and failures as as we can and um you're you're totally right that um uh that you know the length of task for for humans is one metric that explains like a decent amount of this variance. Um but there there are definitely other other things that are that are going on. Um so so defin yeah yeah yeah like um I think that uh yeah we're we're like actually yeah kind of currently trying to um uh trying to figure out like what are uh you know other uh properties of tasks that explain um uh their their success and and and failure well and um yeah I think I think we would love to have uh have have something something like this I mean you know for for something like messiness um like I think there still is uh or yeah for for a lot of these other kinds of metrics that you can think of um to me the biggest issue uh that that I see or the biggest challenge uh is is just some some kind of thing of of subjectivity. Um so people have like very different senses of like what you know what is what is a messy versus clean task. Um uh and kind of depending on depending on your like priors about um >> uh you know so so yeah one one example is um yeah I I I I have a colleague who um I think it's fine I think it's fine for me to talk about this um like like he basically uh like would not rate any of our tasks as being messy at all uh because uh you know they they have algorithmic scoring functions for example um so uh you know you know the the the success or failure is like defined by this like you know kind of like small um uh surface area or or something. Um and you know they they they tell you the tasks like tell you what to do for example right in the real world you know like a lot of the challenge is figuring out what the hell you should do in the first place. Um so so I think I think that's like uh you know a challenge but um I I I definitely think I mean um especially with uh yeah you you kind of mentioned this um this um uh randomized control trial that we that we ran recently of of developer productivity where um yeah we we saw that um uh you know de developers uh at least you know when we when we when we measured this um uh were not sped up by by AI systems and um you Yeah, trying to understand like what the gap between uh uh you know benchmark scores and and uh you know this uh some some of these more yeah ecologically valid um experiments uh you know what what that gap is or what explains that gap um I think we're like super interested in. >> So actually speaking of um the relationship between you know how how good things are at predicting success. So >> Mhm. >> One thing that you also do is you look at uh the variance or the correlation between models of like um you know if model A succeeds at a task like how does that predict whether model B succeeds at the task as well. So so this is one of these like fun uh diagrams that you have in the appendices and it's possible that you just don't know the answer to this question. But one thing I noticed when looking at these diagrams of correlations is there's this block of like GPD4 and beyond models that seem much more correlated with each other on what tasks they can succeed and fail on >> than preGP4 models. >> What's going on there? Is it that they've like standardized on training sets? is that they've like are they now like is everyone after GP4 like trying to train their models to do software engineering and that's what's going on or >> Yeah. >> Yeah. Tell me about that if you can. >> I I don't think I know actually know the answer to this. Um I can I can speculate um I think I'm actually not I'm not certain that this isn't um uh like an artifact of our kind of particular setup. So, so one one thing you brought up is um you know uh GPT2 uh uh has um you know if you put GPT2 in the same kind of agent scaffold. So so for the for for recent models you know we we have them in this loop where you know they they see some instructions and and some you know the state of their environment and then uh you know they they think about and consider what actions to take and then they you know take an action and use some tools and um uh you know kind of kind of continue. If you put GBT2 in this loop, it just like totally totally flops. Um uh and so it's basically like you you you kind of you you can't really make um like a perfectly uh direct comparison um or like you you do actually have to use like a different um uh yeah different methodology. I'm not I'm not certain that um that yeah this this like kind of block in in the correlations isn't uh isn't because of um some some difference in in like our our agent scaffolding for example. Um, yeah. I Yeah, it's a really good question. I would be curious uh to to know. Yeah, I actually don't know if we if if we know. I feel like I I Yeah. Um uh there's probably been some discussion about it, but um yeah, I would need to I need to check. >> Oh, also another thing that I'm that just occurred to me with the alternative like alternative difficulty measures. Um I have a colleague so I have a colleague of mine back when I was at um Chai called Cassidy Lelaw who has a paper I forget what the name of the paper is. It's going to be in the description and uh I'll send it to you afterwards where basically the thesis is um if you want to know whether deep reinforcement learning works on an environment or not um like if if you're familiar with re reinforcement learning algorithms or like um one or like one idealized reinforcement learning algorithm you can do you can start off with a random policy and then you can do like iteration where like okay what's like what would be the best action for me to take given that like from this point onwards I'm just going to act randomly and then like okay what would be the best action for me to take given that from this point onwards I'm going to do the thing that would be best given that from that point onwards I would act randomly >> etc. And I think like basically a prediction of like a very good predictor of how well deep reinforcement learning works on various environments is just like how many steps of that do you actually have to do? If if I recall this paper correctly, people can read it in the appendices. And I I feel like something like that might be >> like one nice thing about this is is that it sort of gets to >> it doesn't get to the access to the aspects of messiness that are like vagueness or whatever >> because like um you know this is just reinforcement learning where you have a defined reward function but it does get to some of the like agent how much do things depend on things. >> Yeah. Yeah. Totally. Like kind of how fragile like um uh Yeah. Yeah. interesting. >> Yeah, embarrassingly I remember very very little about this paper, but it's it's uh people should read it. Um, so I think the last thing I want to ask about um in the well like kind of digging deep into the time horizon stuff is that um at least for now one thing that readers notice when looking at this is there's basically this this line on a log plot of like you know year and uh time horizon and like models are basically like lining up along this line but then it starts looking like once you get reasoning models they start like bending up a little bit. They're a little bit above the line, right? Um and as of the So, so um AI ability to complete long tasks, I believe that was released in like February or March of this year. Like pretty like early when we had um >> not as many data points. We've gotten a few more data points. Um and yeah, early on there was some speculation of like, okay, >> have we like are we going super exponential or not? >> Um with more hindsight, are we going super exponential? Yeah, great great question. Um I would I would love to know the answer um to that. I think we we don't we don't we we we still like don't don't really know. Um uh I guess yeah, a couple couple things to say at least. So one is um yeah, I think I think um uh yeah, since since we released the paper in March, um uh new Oh, yeah. One one thing one thing that'd be useful to to just point out um for for listeners is that um this plot um where we measure the trend of improvement over time we're only using uh the best kind of model at a given time. Yeah. And so um that's that that's just relevant because you know yeah a lot of there are a lot of other models that you know have different you know have different like trade-offs or like you know maybe um you know they have faster inference but they're you know they're weaker and we're we're just using the models that that perform the best. Um anyways since March um you know yeah frontier models um so so one one thing we kind of look at in the paper is um uh we um yeah we kind of noticed or actually yeah this is this is um useful to to talk about because I think the I think the timeline is uh is like useful for understanding like or the timeline of how this the paper came together is is useful. So um we actually initially only fit the trend on models from uh uh I think I think basically 2024 onwards. So um yeah I think the first version of the graph uh was was made by uh by Ben Weston uh in December 2024 um if if if my memory is is right. And um yeah, I think this was just using that that year's models. And with with with those models, we actually observed yeah, like this 4 month uh uh kind of doubling doubling time uh in in the in the time horizon. Um and then we were like, well, does this does this trend extend backwards? Um uh and so we kind of do the we you know, in the paper we also do these kind of backcasts um uh from from from this. But um yeah, so then we kind of added um added in um yeah, pre previous models. Um all that's to say that um we to to some extent from from the start we kind of um have seen these like two two trends essentially. I mean, I think this is all kind of like uh like I don't know, BS or or something like you know, you know, once you once you start once once you when you if you have like 10 data points or 15 data points and you're like fitting like pie-wise linear functions, like it's kind of it's it's like pretty sketchy. Um so so I I I don't um I I I like definitely don't want to to overclaim but like um it it does seem like this like 4mon doubling time uh uh trend from 2024 onwards has has continued to to hold or um uh is has been like a much better predictor um than this like seven-month uh doubling time that is suggested by um you know uh the the yeah models going back to to 2019. Um, yeah. So, I think I think like yeah, my my like best guess that's like very low confidence uh is something like we're kind of just on this like fourmonth trend now. Um, but it's still it's still kind of just just exponential or um it's really it is it is really hard to like distinguish between like um you know different different kinds of model fits um to to some extent. So and actually actually the thing you said about like different models um >> made me wonder so if we're saying that like um time horizon is going up over time um that means suppose I want to project that into the future right it's one thing if um this is true at like basically fixed cost right it's another thing if like you know >> like it's always the case that you know a one minute task costs $1 a two-minute tasks costs $2. A formative task costs $4. And then like um you know, like maybe we get t models that can technically do things, you know, like that a human could do in a month, but like it, you know, it would be cheaper to just get the human to do it for a month. Off the top of your head, do you happen to know like what the what the picture looks like with cost? >> Yeah, that's a great question. Um this is something we Yeah, try try and keep an eye on. Um so um I don't I yeah let's see so uh for recent models we have been using um like our our our agent scaffold um has a token limit that we tell we tell models about so so that they're they're aware of this but um yeah I think I think we've been using a token limit of something like 8 million tokens um which I think for these longer tasks um ends up being like at least one order of magnitude cheaper than paying a human um with with relevant expertise to to complete the task, >> right? Um >> and to get and to give a feel for that, 8 million tokens is something like six bibles of text >> um roughly. >> Yeah. Yeah. It's it's it's quite a lot there there question or like uh uh you know you you you can do much better than than that with like caching. Um so so yeah like most most APIs um let you like um yeah do like prefix caching and um that that that helps helps quite a bit and so you know or you should count it kind of differently I think but uh >> but it's it's like a big chunk. >> It's it's a big chunk. It's a big chunk. Yeah. Models uh you know will uh uh you know do lots of reasoning and uh you know you know run a bunch of different experiments on these longer tasks. Um uh yeah um you know they'll they'll take you know something something like uh uh you know uh 10 to 50 actions or or or something kind of in the environment. Um but then you know for each action they're like doing a bunch of reasoning and um yeah often they're uh you know we it depends on the exact uh agent scaffold but in in in many of them you know we have like models that kind of propose actions and then review them and then select the best one and um yeah so there's kind of a lot a lot going on and this is still much cheaper uh than um than than having people do it. Um I I I I wish I knew the exact um the exact numbers on on on cost. Um yeah it is it is um yeah more more complicated because of caching but um yeah so currently this isn't like like the biggest concern of ours um because because of this basically where um uh you know >> uh basically yeah model models still are are just like highly highly comp you know cost cost competitive. I totally imagine this changing at some point. um uh you know uh yeah just just trends in models kind of uh you know being able to use test time compute more more effectively. Um I I I totally expect um you know for for very long tasks this to this to get expensive and um it to be like yeah very very important to be measuring um uh you know the kind of uh prito frontier of uh of of cost and and you know success rate or something and um yeah I think we're think we're excited to to do do work on this as it becomes more relevant I guess. >> Yeah fair enough. So, so like zooming out, um, meter, so model evaluation and threat research. Um, I kind of think of meter as like trying to figure out how scary models are. Um, and if they're scary enough, then like I don't know, maybe we should do something. Um, so this kind of work um of measuring just sort of general software engineering capabilities, right? Um, and and trying to forecast them over time. >> Yeah. what's the what what's the rational behind this? Why focus on this? >> Yeah. So I think there are um broadly uh the kind of threat model that um yeah Meter is kind of most concerned about uh at least at the moment um uh is uh kind of rapid acceleration in uh in AI capabilities and in in in fact like the rate of progress of of AI capabilities um due to um AI systems being able to contribute substantially um or or like yeah contribute you the the majority of of AI progress at at some point in in the future. Um so yeah, the idea is currently uh you know the way you make AI systems better is through a combination of uh you know compute, hardware, resources, you know, money, data um and talent, labor, right? Um if it becomes the case that AI systems can uh kind of replace uh the the um the the labor, you know, the the talent part of this um uh you know, in in kind of economic models of of progress uh in in at least some of them. Um I I I think broadly like they're they're they're reasonable, although I'm not I'm not an e economist. So um yeah, you know, you you you can see uh like you know, very very rapid um uh progress and basically um uh this just seems like broadly kind of scary or something. Um so uh you know one one example is uh you know you you might see um you might see like very rapid centralization of power um in you know a single organization that um you know does does this kind of recursive self-improvement um and um yeah that that that's concerning um you know for for kind of general like stability geopolitical like democracy kind of reasons. Um and then you also uh uh your kind of arguments for why the AI system itself uh is you know not not going to uh kind of be be dangerous. Um uh those might [clears throat] break down. Um so you might not be able to uh evaluate it effectively because uh you know for example um the the system uh you know may may have like a really good understanding of exactly how you're evaluating it and um you know if if its goals are are different from uh from from yours then um it it might be very easy for it to game your your evaluations. Um uh your kind of uh you know supervision methods might break down. You know you're you're reading its chains chains of thought for example. Um, and you know, it looks, you know, the model is saying things that seem very safe and nice and reasonable. Um, but actually it's doing some kind of hidden reasoning um in in the background that that that you can't detect. Um, um, and you, you know, you didn't realize that this was about to happen because progress was so fast and, um, you know, because as a lab you were just like scrambling to kind of get as much compute and, uh, make as make as much progress as as you can as quickly as you can. Um and so so yeah, broadly um this is I think yeah one of one of the kind of big big concerns um or or questions that we we want to understand is how close are we to this uh kind of um you know uh rapid rapid acceleration. Is that is that even possible? Um you know I I as I said like you know labor is not the only input to AI progress. Um you also have uh you know compute for example uh and and and data and those things um you know these things might be kind of highly complimentary to to labor such that um even if uh the amount of uh you know talent increases uh by a several orders of magnitude because you have you know all your AI researchers doing doing this work um you still might you might end up still very bottlenecked by comput and uh and data and so you know trying to get trying to get some understanding of that I think like we're like like we do some work um uh or or we think about this to to like some extent uh these kind of like economic models. Um I think this like isn't our like chief chief kind of forte or or something. Um uh epochi has has has a bunch of great work um kind of doing some of this modeling also. Yeah. Um uh folks at um uh I think the or is called for forthought. Um will will mascal and and Tom Davidson uh have have done uh work on on this kind of economic modeling. Um anyways yeah so um you know understanding how capable AI systems are is you know a big a big input um to to this and and yeah software engineering and ML research capabilities are um are kind of yeah highly highly relevant >> and and how much is the desire to so one thing you could do with this is you could say like okay like are we there or are we about to be there and like that's the point of the of you know doing the measurements. Another thing you could do is you could you could be trying to say, okay, you know, are we going to get there in 2030 or are we going to get there in 2050 based on what we know now? >> Um, so so yeah, how how much is the thing you're trying to do like a forecast versus a now cost? >> Yeah, that's a that's a great question. Um, I think we would love to be able to do like really good forecasts. Um, uh, unfortunately I think it's really really hard. Um so yeah, you know, for for example, as as we kind of talked a little bit about um you know, new new paradigms in uh in in AI um might, you know, change change the trends that we observe. Um uh also uh you know there there are lots of inputs to these trends that uh might not be durable. So, so for example, like uh you know, we're we're seeing like yeah, the the time horizon of of AI systems is increasing exponentially, but also uh the amount of money that uh uh and and the amount of compute being put into training AI systems maybe maybe has also been increasing exponentially. I actually don't I I don't know the exact uh details of like yeah how how compute spend has been increasing, but >> I I think it's exponential. I I I feel like I feel like if I go to Eupai, they're going to show me some some like nice graph. Yeah. >> Um and so, you know, maybe maybe maybe that's just the cause. And um uh in fact like you know we're just going to hit some like like bigger kind of bottlenecks in the economy more broadly like it's just not going to be possible to like fund increasingly large data centers and um so so there are a bunch of I don't know yeah like I think oh yeah this this is kind of an interesting um uh point is I basically view this like time horizon trend that that that that we're seeing as something closer to like an economic model um than like uh a kind of um like ML like uh benchmark model or something where I'm like the actual inputs to this progress um you know are firms that are competing uh to train increasingly better models and they're putting these resources in and they have you know these constraints and and whatever um and like I don't know I think this was actually for me at least um the one of the one of the big updates or or something is um I'm like I'm like I guess I think I am uh I'm much more interested in economics um uh as as a result of seeing this like really robust trend because so so I was actually extremely skeptical um of like putting time on the x-axis in in particular I I I was like I was like yeah the inputs are just going to be these like random kind of decisions by different labs like um uh and you know there's no way we're going to see some like robust trend because you know it just depends on you know who uh you know who who Jensen you know happens to like or whatever. Um, you know, >> Jensen Juan being the CEO of Nvidia, right? >> Yeah. Yeah. Um, uh, you know, for for different compute deals or something. And I was like, you know, no way that could be like robust or something. Um and so yeah, that was that was a that was a a decent decent update for me like well maybe um you know maybe maybe these kind of like extremely abstract economic models actually can be very um very informative or like maybe there is like this kind of deeper like systematicity to um uh you know AI progress even though you know zoomed in it feels very contingent and kind of arbitrary or something. I don't know. This is all like very much speculation or like you know just kind of me my my musings on uh on on this. I think as an org, yeah, we would we we're definitely interested in in forecasting. Um uh I think there are like yeah, kind of trade-offs, you know, between uh doing this like yeah, more more abstract modeling and uh just focusing on like um you know, we we we do a lot of work on on on this like now casting kind of thing like just like you know, currently how good are AI systems is kind of an open question or like you know there is a lot of disagreement about this. Um you know even even internally at meter we we have disagreement about this. um uh you know probably there isn't like one single answer right um because it's just a complicated question but um yeah I think I think we're like trying to do kind of both to to some extent um yeah >> fair enough um so so for either forecasting or now costing right um suppose I want to use the time horizon's work or like the nearest successor to tell me you know when are we going to get this like AI's um you know feeding into AI progress. How like how am I going to use the results of like oh it's like 3 months is that are we are we at recursive takeoff? >> Yeah, I think I think this is I think this is kind of an open question or like I don't think we have nearly as good of an answer here yet as as we want. I mean we we have like um we have like kind of heristics I think um uh like one one week of work um like like time horizons of like 40 hours. I think we we definitely um are like getting getting a lot more kind of um concerned or or um it seems seems at least plausible that um you could kind of successfully uh or like efficiently delegate like um you know weeks worth of work uh to to AI systems and and that you know I I I could I could totally imagine that speeding up AI progress uh quite quite a bit. um uh uh you know yeah um same for kind of you know time time horizons that are that are much longer but uh I think I think we don't really know um is is kind of my my my answer. I mean, part of part of this I think um uh uh or like part of part of my uncertainty I think I probably would have been more confident in something you know something like a week or a few weeks of work as like a time horizon that is is very useful for um for you know as some kind of like rough rough furistic or threshold. I think I would have been more confident in that maybe um before um this productivity RCT where where we found that um you know people were were very miscalibrated on on how much AI systems uh uh sped them up de soft open source software developers um uh in particular and um and and in fact we we saw that they were like slowed down uh on on average um by like 20%. Um, I think I think these pro yeah the the the time horizon's work and and and you know this this uh randomized control trial results I think they're probably not as like in conflict as they might like seem at at at face value um for for reasons that we could talk about. Um they did they definitely did update me more towards like kind of broader uncertainty about um this like interaction between you know AI systems and people and like maybe maybe maybe we do end up really bottlenecked by um uh you know things like our ability to like specify tasks uh really clearly or um uh you know maybe maybe um you know things like kind of the fact that we're you know algorithmically scoring models um you know we might be like overestimating their their capabilities because of because of that to to to some extent. >> Yeah. Actually, in terms of other bottlenecks, I'm really interested in talking about that because >> so if we're interested in like like suppose I want to know like at what point does do we get this like runaway process or whatever. >> Um it really matters whether um AI is automating like well like suppose there are like five things you need to be good at to like to do recursive self-improvement, right? Like the difference between AI being able to do four of those and AI being able to do five of those is like huge, right? Yeah. >> Um, yeah. I think one concern I might have about the about the meter benchmark stuff uh or about this particular paper is just um yeah, is it is it covering all the bases or is it covering like some of the bases like kind of >> um >> and ju just because like potentially that could like really reduce its value for this particular um thing. I'm I'm wondering yeah what do you have thoughts about that? Um I think I think that's a pretty pretty legit concern. Um yeah, I mean I guess I would I would be interested in like um or you know um there's this question of like yeah well well what what are the kind of specific specific things um uh that are are bottlenecking and kind of how how different are they from the things that that that we're measuring? Um so so so one kind of broad reply could be something like well um uh you know to the extent that that our benchmark is just like a bunch of like uh you know kind of different you know diverse tasks um you know hopefully uh uh hopefully it's the case that you know we're kind of covering some some some decent amount of the space of of of necessary skills or capabilities such that um we would expect results to be like very correlated on things that we're we're not you know measuring specifically um you know and we can maybe get some kind of some kind of sense of this with um you know by by looking at the variance of of model um uh performance like on our on our tasks. Um >> I guess one thing you could presumably do is just like have a held out 20% set and just see like okay does does performance on like the held out set like or does performance on the non-held outset predict performance on the held out set and you I guess that's probably in some appendix somewhere. I think the thing you would want to be doing there is um you would want the held out set to be kind of importantly different in some kind of biased uh or or like systematic way. Um uh and I think I think that would be that would be interesting. I don't Yeah, currently we um we we haven't done this. I mean to some extent maybe I think the the kind of messiness analysis is is kind of trying to get at at something like this like you know are there yeah are there other factors um that that explain um model model capabilities it seems like kind of um >> yeah I guess there's also this um uh I guess there was a blog post uh meter put out um basically trying to do a similar analysis for other domains. So like you know there's a little curve for self-driving and there's curves for um I forget exactly what all the other tasks were but um >> so my recollection of that is that it seemed like in each domain you maybe had some sort of exponential increase in time horizons but like best fit doubling times were different um in different domains. Yeah, I think so. So, yeah, my my broad takeaway from from this um from this work that um yeah, Tom Thomas Qua um uh led was um that in kind of uh you know decently similar um uh domains. So like questionans answering benchmarks uh for example. So so GPQA was was like one of one of the benchmarks and and there were a few others. Um I think we saw like like quite similar um doubling times overall is my is my memory. Um uh and actually even like like overall pretty similar um like absolute time horizons. Um which was yeah some some kind of you know some amount of of validation. Um, you know, the the the challenge with this kind of work is, um, you know, we we put a lot of time into like estimating the t the lengths of our tasks, right? Um, and so we're using these kind of like scrappier uh uh kind of yeah, more more heruristic or or or um less precise estimates of of task length for for most most of these other um domains. Um, yeah. And then I think yeah I think self-driving did was was a slow did have a slower doubling time but I don't think it was like clearly not exponential. Um and then yeah the other the other interesting takeaway I had from that was um with respect to uh kind of like like more general computer use. Um so um yeah there there these benchmarks um there's this benchmark OS world um uh uh that uh yeah has has a bunch of kind of like um you know you have a browser and you need to do you know these tasks or you're in this you're in this uh yeah kind of uh kind of operating system and um you know you have to click around and um you know manipulate kind of normal normal software um uh which is like a bit yeah which is uh the key difference I think between this and uh a lot of our tasks uh is that um our tasks are almost almost entirely uh kind of text only. Um yeah, you know, model models are are weaker relatively at at you know, multimodal um tasks, it seems. Um but but yeah, so so I think for those for those domains, I think they had like some something like a kind of similar doubling time, but um the absolute time horizons were much much lower. Like they were they were like Yeah, I think it was like a like a couple minutes or or something. Um which which I thought was interesting. I'm actually kind of confused about broadly or yeah, I don't I don't I don't really understand what's going on there. >> So, okay, with all that said about like kind of the pros and cons of this sort of framework for measuring um kind of for tracking like are we getting close to some sort of self-improvement cycle? >> Mhm. I'm wondering like what's your guess about whether let's say one or two years from now we're still thinking that like something basically like time horizon is the metric that we're tracking or we end up saying like oh we actually did it you know there's something like pretty different um and that's that's like the real thing. >> Yeah. Yeah. That's a great question. Um, yeah. I think I think to me a lot of this comes down to the kind of tractability of continuing to to kind of use this metric and um and like estimate it. Um, so I think this is kind of uh this is somewhat unclear. So for example um you know we you know we've paid a lot of people um you know money uh for their time to to kind of work on these tasks and >> you know so we can estimate how long they take. Um you know if if the task if the length of these tasks you know becomes like you know they're weeks or monthsl long tasks this gets like pretty expensive. >> Um >> actually how expensive was it to to make this this paper? >> Ah that's a great question. Um I think it's a little it's kind of tricky because um you know there were there were like these different efforts going on. So so like you know we included the rebench tasks and the baselines for for for these tasks um and that was like kind of a separate separate project. So it's maybe depends on if you count that. Um I mean I think I think that the baselines for uh for the kind of main set of tasks um uh that that we used the the Hcast tasks I I want to say that these were somewhere in the range total of like at least at least tens of thousands um possibly low hundreds of thousands um of of dollars some some something in that range. Sure. Um yeah, I actually should um I probably should know know this uh fair off the top of my head uh more accurately, but um yeah. >> Yeah, but but but it sounds like um it's reaching a stage where like the yeah, measuring these time horizons is like uh you know, getting close to the dominant cost of actually like doing this work. Um yeah, there's there's still the it's probably lower than the salary cost of, you know, you've got a bunch of people working on it, but like >> yeah, >> if it were to become more of a thing >> at some at some point, I think, yeah, you um uh this this does this does start to to dominate. Um although yeah, I I I would say that I think currently um like actually creating the tasks is the is the kind of most most expensive and difficult part. M >> um yeah so so either kind of creating them from scratch or kind of trying to like find uh you know good good tasks in the wild um as it were like um uh you know which which is nice because a they already exist and uh to to some extent um although you have to kind of port port them over into your into your framework um >> uh but also uh you know that gives you kind of more confidence that they're you know realistic and and representative of of real work that people are doing which which is important when we kind of don't fully understand exactly uh you know when and why yeah systems succeed or or fail. >> Um >> actually maybe maybe this is worth talking about a bit. Um >> yeah I think there's some >> there's there's one kind of uh approach to measuring AI systems which says look we need to like isolate things we need to get down to the like >> you know the simplest feasible task where we can really measure exactly what's going into it. Um and and these end up being things, you know, maybe like if you think of arc AGI, it's >> it's it's not quite this, but it, you know, it's it's something sort of like this. >> Yeah. >> Um versus um >> a sense of like no, we need to we need to create things that are as similar that that have this like realness flavor um even if they're not exact like you know finding an MP5 hash collision. It's not like on on some micro level it's like not very similar to doing AI research, right? Um yeah. C could you say a bit about like >> yeah the the like how important it is to be thinking about like economic uh usefulness versus like trying to mimic a sense of like what the tasks you care about are. >> Yeah. I think that there is a very real trade-off here um between the kind of yeah level level of like granularity of your like understanding um uh uh where you know on on you know if you kind of maximize that you end up with yeah these like very kind of um uh often like simple kind of formulaic like systematic benchmarks um that are that are just probing some like very particular kind of skill um in in a systematic Okay. Um uh and then on the other end, yeah, you you have this kind of like realism maximization uh lens. So I think you know the the best like you know kind of popular example of this maybe is is Swebench or Swebench verified um where these are uh you know these are these are like actual uh uh you know GitHub issues and and and PRs uh that you and and and tests that you're measuring AI systems against. Um, I think there's a real there's a real trade-off here where, you know, on one end, yeah, you you get this granular understanding and then on the other you uh, you know, it's really easy to interpret what a certain, you know, success or or failure uh, you know, means. It's like, okay, yes, it can do this thing in the real world that I understand. I have some intuitions about, you know, um, yeah, so I think there's a real trade-off. Um, I I think that like um, yeah, what do I what do I think here? I think it's really I think it's really hard. Um I mean broadly broadly I feel pretty pessimistic about uh this like kind of granular uh like approach. Um I think maybe maybe this like has has something to do with uh you know the the um amount of systematicity in uh you know neural networks themselves or something um where it's like well uh you know >> like they are just kind of inconsistent but are still capable of really impressive things uh you know often um and so like maybe you just can't you just can't get like this kind of like extremely crisp uh understanding and you just have to, you know, kind of aggregate or or look look more broadly at like things that actually uh, you know, are are relevant for your kind of decisions about whether to deploy a system or how safe it is or or whatever. I I think I think that's probably like where like like the direction I I I I lean in or something. >> Yeah. I also wonder if there's something along the lines of like like often these sort of highlevel e things. So take like something like economic growth, right? It's an aggregate of like a bunch of things a bunch of people are doing. Um it's like not very well isolated and also it's like relatively smooth and predictable like like not totally but like it's pretty smooth, right? >> So like time horizon, you might not have thought that it would be this nice trend, but it is. >> It's so part of the reason I'm going to tell you about a book that I'm reading. Part of the reason this is on my head is that um I'm reading this book, Inventing Temperature, >> uh which Yeah, it's it's uh very popular in these like less wrong spheres and I'm finally getting around to it and >> I I I haven't read it yet, but I've been I've I've heard lots of great things about it. >> That's great. Uh I'm going to spoil it a little bit. So yeah, so the first chapter is basically about the t problem of like suppose you want to have a thermometer, right? Um like suppose you want to standardize a temperature scale that all all these thermometers use, right? In order to do that, you've got to find some phenomenon that's like always the same temperature, but that's repeatable. >> Yeah. >> That a bunch of different people can use. >> Yeah. >> Right. So, firstly, there's a bit of a weird circular thing where you have to know that a phenomenon always has the same temperature before you have a thermometer. Right. >> Uh which like okay, maybe you can use the same thermometer and do it multiple times and you just trust that like volume of the mercury or whatever is like a good proxy for the thing you want to talk about as temperature. Um, but so so one funny thing is there were initially people were just like really wrong about what could possibly work for this. Like you have people saying like what if we do just do like the hottest it gets in summer >> or like how cold it is underground, right? >> Oh, that's so good. >> Yeah, it doesn't quite >> I love it. >> It doesn't it doesn't quite work right. Um but eventually people are like oh we're going to use boiling water, right? >> Mhm. Now firstly like we now know that the temperature the water boils at depends on the atmospheric pressure right luckily they knew that as well so they were able to control for that. >> Oh okay cool. >> But you you ever like >> how do how do they know that? Did does the book talk about that? >> I don't know. Um it's a uh up to so I've only read like most of one chapter or something but um the like I I think you can do a thing where like you like especially if you're looking at temperature as a proxy for like volume of a liquid thing um and you're doing like like a lot of your thermodynamic knowledge comes from stuff like brewing or engines or something you you end up in these situations where you have things at different pressures and different volumes and like >> I think that's the kind of thing that you can figure out. Okay, >> especially if you have this identification of temperature with like volume of a thing. >> Um >> but um like volume of a thing under fixed pressure and fixed conditions or whatever. Um >> so the so it's like okay boiling water, right? Um have you ever like like do you like cook pasta >> where like you Yeah. So, one thing you'll notice is that like there's like, you know, first like bubbles start appearing and then you start getting a bit of a boil and then you start getting a rolling boil, right? And there are like different >> like the temperature of the water is different at different points of this, right? And also the temperature of different bits of the water is different at different points of this, >> right? So, like what are we talking about when we're talking about boiling temperature? And like if you look at the cover of the book, it's this picture of an early thermometer that has like one line for like mild boiling and one line for like it's really solidly boiling vehemently I think it says and like these are different temperatures, right? And so like >> so so there's this one scientist who does this approach of like okay >> like like what are we talking about boiling water? Like he has this theory that like one thing that happens with like fake boiling is that water has like little bits of air in it and those those little tiny tiny air bubbles like you start getting evaporation into that air bubble >> and then like that air bubble sort of gets hot like it like rises up and you start seeing vapor but that's not like true boiling of the water. That's only there because there's like these these impure air bubbles. And so he starts going down this line of work of like, okay, let me let me out of the random little things, right? We're going to get rid of we're going to have a smooth as smooth a possible surface as I can. We're going to get rid of all of the >> um air bubbs you can get water like way above 100° before it actually boils, >> right? Um and it turns out so so basically the thing they end up doing is uh the answer turns out to be that water vapor is at a very consistent temperature even when the temperature of the water is not at a very consistent temperature. >> Um >> but the reason that's true is precisely because there's a bunch of just like dust in the air. >> There's little things that things can nucleate around and that stops vapor from getting too hot or too cold before condensing. >> And like If you and in fact there's have you heard of like cloud chambers? >> No. >> So they're used in particle physics and basically like what they do what they have is they have this like um like super cooled vapor. So it's vapor that is under 100° C. >> That is sort of ready to condense but doesn't have a thing to nucleate around but if you like shoot a particle in it it like condenses around that and so you can like see the trail. >> Oh cool. like in in thermodynamics, right? There there's this general thing where like if there's a bunch of like random messy stuff that produces a bunch of like observable regularities like a somewhat higher level, right? >> So like we have this in thermodynamics. >> It seems like we kind of have this in economic growth. >> And part of me wonders >> if like that's kind of what's going on in like how we should understand neural network capabilities >> or maybe I just read a book and I like it. No, no, I love I I love this. Yeah, I think this is Yeah, this this like general idea um is like super super interesting. I mean, yeah, there's like, you know, another another model you could have uh for models uh or yeah, for for like AI systems um uh like how how they're how they're performing on on tasks. um is you could you could imagine that um there's there's something like a kind of constant failure rate um that that AI systems have um as they're as they're attempting uh attempting tasks and um uh you know different different tasks uh you know might have uh like different failure rates um uh which you know and and and so that kind of complicates things >> and by and by failure rate you mean like per time a human takes to do it >> something like that yeah exactly exactly to Toby or actually um uh uh did some analysis um or some some kind of follow-up work on um on the time horizon paper where um >> uh yeah uh if you if you assume this like uh kind of constant hazard rate or or um yeah per per time that people spend uh you know there's there's some percentage chance that the AI system is going to you know make some kind of catastrophic uh error and then ultimately like not succeed at the task. Um then this also is like a good predictor of um uh uh of of AI system success um and and and failure on our tasks um as a function of the the length of task for for humans. We we in our paper we used like a logistic fit um but uh you know assuming a constant hazard rate um you would use an exponential uh fit. Um, >> I do think that Lawrence Chan had a response to that which said that like logistic fit was in fact better and like even even though it used more parameters or something. I remember a response of long baselines. >> Totally. Totally. So we yeah we we we did we did explore like different different fits and and logistic was was a better fit. I I I don't I don't think it's obvious like how or like um I think because for example because um because of this aggregation of like maybe maybe different like distributions of tasks um uh I don't think it's obvious like maybe how much we should wait uh you know the exact um the exact quality of the fit versus like kind of priors on like like simplicity or um uh or like yeah you know this is this is a nice model maybe you know um uh I don't know how much to to weight that. Um, but I think I think stuff stuff like this to me is like very very interesting. Um, uh, in ter in terms of like kind of understanding capabilities. I think yeah, I I I do kind of feel like um I' I've really often yeah felt like like getting at getting at something more like uh like the intrinsic number of actions needed to complete a task. um uh feel feels to me like intuitive and and I think I think yeah other other other folks I've talked to like feels feels like a really a really nice uh kind of kind of thing that um could uh could be useful for for understanding this you know it it kind of um uh you know you can imagine it like uh like um slotting well with this um uh like constant hazard rate uh model where it's like you know for each action um that you need to take or something Um but but actually operationalizing this um I think has been has been tricky. Like we've we've done some analysis of of this and um it's been kind of difficult to to to like extract like really good insights. But >> yeah. So, okay. I I think like I think we're currently on a tangent from a question I was asking a bit ago, which is um uh or I think I took us on a tangent, which is like in like two years from now, do you think we're still using something like time horizon? And I think like so one big response you had is like, well, will we be able to like will it will it just be infeasible to like >> actually measure these time horizons? I'm wondering, so setting that consideration aside, I'm wondering if you have a sense of like this is probably just the thing that's going to continue to be more robust or like probably we're going to come up with like number of actions model or like something that incorporates the like messiness results or something kind of like that. Yeah, I think my I think my guess is my my my best guess is like we're going to keep using it assume assuming we're able to like kind of continue continue estimating it in a way that like um we feel we feel confident in. Um I think my best guess is we'll use it with like kind of different um I don't know different like waitings or like multiples or or or something um based on you know some some of these other factors. Um, >> but I guess yeah, to me I think I've become more pessimistic about um yeah, figuring out things like um uh like like number of actions. It's not to say I I mean I would be super I'd be super excited about that and you know I don't know think there's a decent chance I'll I'll like you know take another stab at it or or something um at at some point. Um uh >> I wonder if one okay here's here's one suppose we think that like yeah econ economic relevance like trying to mimic real world utility is just like the thing one thing you could imagine doing is like we're just going to figure out like what the market rate is to get someone to solve this task right >> which is like a mixture of expertise and time taken. >> Yeah. >> I wonder if do you have a sense of whether that would end up being a better predictor? Yeah, it's a great it's a great question. Um, we I think we have we have looked at this um or like tried to kind of estimate this by like kind of clustering our tasks. I Yeah, I shouldn't I shouldn't speak too much to the details cuz cuz I I can't remember exactly what we did, but you know Yeah. Something like Yeah, exactly. Um you know um just just uh look at like you know these tasks are like really hard ML tasks and so you know they're going to be more expensive and these other ones are are cheaper. uh you know and and yeah there's some um some trade-off. I think I think something like that could be could be reasonable. I mean I guess that the the reason why I or like a reason why you might not expect that to work um is that AI systems broadly have a different capability profile than people. Um, so you know, uh, if if the year if if if it was like uh I don't know like like 1920 or something or or actually actually let's say like uh like 19 like like 50 or 40, you know, like maybe maybe like right before we had like calculators. Um, you know, if you were like doing this uh doing this math of like uh you know um uh uh you know uh people people you know how how long does it take to to you know pay people to pay pay like you know human computers to uh to calculate you know the product of like 10-digit numbers that you need to do for whatever reason um you know you'd be like yeah that's like an extremely like hard task you know um >> and so as or like you know machines uh are you know are not going to be able to do that task for like such a long time. Um uh but in fact uh you know pretty pretty quickly after like yeah computers were able to to do this very very well and so then yeah there yeah just applying this to to modern systems you know um I think I I and and I do basically believe this actually is um uh like like AI systems are like way way better at uh these like at at tasks that like seem to require humans like many years of like kind of intellectual um uh development. uh and and and labor to to complete. Um uh yeah, like they yeah, they can do GP GPQA questions, they can do IMO problems, you know, um these these sorts of things. And so I think I do view this as less of the less of the bottleneck basically. Um, and I think I do view some something kind of more more akin to like agency or like yeah being w to the like me, you know, messiness factors or um yeah, that's not to say that there aren't other metrics. May maybe this is just an argument I guess against um uh like uh yeah, like kind of expertise or or human expertise or something. >> Um yeah. So, so I guess with that said, um, you know, we have the, uh, we've got the time horizon and stuff, we have Host. I'm wondering like what Yeah. What to you, what are the open questions and like like what kinds of things might I see out of Meter in the next year or so? Um, like pushing this research direction forward. >> Yeah. Yeah. Great question. So, broadly, um, I think there are, yeah, there I mean there there there are a few things. Um so uh one is kind of yeah contin continuing uh to to use this methodology. Um so um you know uh currently models can do uh you know have 50% success rates on on these like two-hour tasks like GBT5 I think is like 2 hours and 15 minutes or something. Um it's the time horizon and um uh you know if if we really are on this fourmonth uh you know double doubling time trend that's like you know we're at like four hours by the end of the year like 8 hours um uh you know spring of next year uh 16 hours uh like fall next year that's like not that long. Um uh and yeah our t like we have you know we have fewer longer tasks and um you know we have fewer baselines on these longer tasks because they're more difficult to baseline. you know, you have to find like people with like yeah, more more specialized expertise and they're more expensive and people fail more often. Um uh and so yeah, kind of extending uh extending our task suite and um uh trying to just see yeah does does this trend continue is like um one one big direction and I think yeah there are like kind of open questions around um how do we how do we actually yeah like affordably continue uh doing doing this um you know are we are we like pulling tasks you know harvesting tasks from uh uh you know existing work that people have already done um are we uh you know creating new tasks and then uh using like LLM evaluation or like more like manual review um to to evaluate success on them. Um are we yeah doing doing other things? Um so I think yeah things kind of in in in that direction. Um that's like one one class of things kind of trying to continue this basic methodology. Um I think there's like another uh kind of class of class of directions that we're pretty excited about which is um something more like uh uh like you know we have this like it's like kind of like what I just described is something like benchmark development um and then like you know like evaluating models on on on these tasks um but then there are a bunch of these questions around uh you know how good are our benchmarks how good are um you know other benchmarks um so I don't know yeah over the last couple weeks I've been like uh like labeling like many dozens of um uh like uh like attempts of models on on Swebench with like a bunch of different um factors to try and understand like okay for example yeah just um uh you know um how good are this are our tests in SWEBench like are models often implementing correct functionality that isn't captured by the tests because the tests were kind of written for the specific implementation that the that the human uh human originally wrote um or you know alternative Alternatively, um our models um are models often uh succeeding uh but you know as as judged by the the you know automatic test cases um but they're they're actually uh you know they actually like break a bunch of other code uh you know that isn't tested um in in the repo or like their solution is just so bad in some other ways or something that um uh you know we wouldn't actually call that like a success. So, so, so kind of broadly um this is like one example of um uh this kind of stream of work um that that we've um we've started doing more of um over over the past few months of kind of trying to understand benchmarks um the science you know this kind of like science of eval stuff like um uh you know yeah how can we interpret you know certain scores on on different benchmarks um you know ones that we've made ones that other other folks have made um uh also questions around kind of um uh you know to what extent are um you know current methods uh uh for improving AI systems uh like going to generalize. Um so you know one one example um that that comes to mind uh of of an open question I think to to us is uh something like you know um uh training models on you know verifiable you know formally verifiable tasks um like passing test cases um uh uh yeah there you know people people talk about like reinforcement learning from verifiable rewards um there's a question of like how much progress currently is coming from uh from this and um Uh maybe maybe there are like you know two sub quest or like two two uh like correlary questions like um you know how much should we expect progress um when training um in in this way to like generalize to non-verifiable tasks or tasks that that are messier or you know more qualitative. Um and then alternatively you know maybe if if if improvements in models uh from from this type of training doesn't actually generalize well um uh like how like what is the kind of um uh you know how much human data for example do you need to train models uh that are that are good um like on on like more qualitative like messier messier tasks. Um so trying to get some some kind of sense of uh sense of things like this. Um I think like um yeah this is this is something we're we're interested in. Uh I yeah the exact projects uh that I think we'll end up uh we'll end up doing you know will will depend on specifics but um >> fair enough. So so that's things that u meteor might end up doing um there's there's a whole other world out there including listeners this podcast um what should uh if they if they're interested in like advancing this research direction what are yeah what would be good things for outside people to do? Yeah. So, one one thing that um I've been really excited about um uh is this work or yeah, basically um uh uh kind of making it easier to run evaluations in in kind of standardized ways. Um so, um uh yeah, this that we've uh at Meter, we've started using uh this this platform for uh running evaluations called Inspect. um that uh yeah like uh is is open source. It's it's uh kind of uh primarily uh developed um by folks at uh the UK uh AI security institute. Um uh but this platform is great. Um uh uh and there are like a bunch of benchmarks that have been implemented in it and I'm I'm super excited for like more more benchmarks to to to make it in and um yeah to kind of improve the like uh I don't know the the the ecosystems like ability to um you know broadly broadly run these these evaluations. Um that's like kind of more on the like engineering uh side of things. Um I think in terms of in terms of research I am I'm excited about people extending the time horizon methodology to kind of yeah more more benchmarks. Um so so actually yeah um uh uh uh someone uh uh this um this guy Sean Peters I think I think his last name is um he um uh he kind of uh uh yeah evaluated models on uh like uh like cyber security benchmarks in particular um and like used kind of time estimates uh from from those benchmarks or I think he did some some amount of estimating task length uh himself um and you know fit fit fit some trends to um to models performance on on these uh you know this partic particular kind of slice. Um uh I thought that was like a like a really useful way of um just kind of uh you know yeah getting getting more data validating um these these things. Um so I think I'm yeah I'm like excited about about kind of you know direct follow-up work like that. Um I think there's like um some something something in the vein of um or like directions in the vein of kind of what we talked about of uh you know trying to decompose model success and and failure or like um uh you know understand like like what are the yeah what are what are the fundamental trends going on here. Um, you know, I think I think like um I said earlier I was like pessimistic about these like you know, extremely like kind of uh uh you know, constrained like less realistic types of tasks. But I but I do I do still think they can be like quite useful for just like like almost as like diagnostics or something like uh just like kind of helping helping b helping bound like our understanding of of what models can and can't do. You know, some some something that comes to mind is people people have made kinds of tasks that are they're basically just like uh you know, how many of like like a very kind of basic action can models like take, you know, in a row um uh before before they like you know, fall over or like get get off track. Um uh so so things kind of of of that nature, you know. Yeah, like uh you know, very very large like uh kinds of arithmetic, you know, that that that comes to mind as as an example. Um, I think things like that are are actually like uh interesting, although I yeah, I think they're they're like they're more they're more to me like kind of bounding, you know, model capabilities or something. >> Fair enough. >> Um, so the second to last question I'd like to ask is uh is there anything that uh I should have asked that I haven't yet? >> Great question. Yeah, I mean I think I think like broadly um we've touched Yeah. We've we've talked um uh we've we've covered Yeah. like uh like um a fair bit of of yeah meters uh like capability evaluation work. I mean I think there's like um there are kind of big open questions to me around uh like how yeah how long we'll be able to continue doing this work. um not not even just from uh attractability perspective but also just from a um like uh like will it actually be be useful um perspective um for in particular for for estimating risk. So like at a certain point you know if if we are seeing that you know AI systems are able to do AI research very effectively um then there's this then then it's like okay well uh you know how do you know how do we continue estimating risk is is risk just like maximum like probably not like you know people are still going to be doing kinds of uh you know monitoring or uh you know I expect I expect um folks to you know implement kind of basic basic kinds of like control um methods. So kind of uh yeah, we've we've we've started um over the past few months. Yeah, we've we've been kind of doing more work um >> trying to kind of create better metrics for uh things like monitorability. Um uh I guess I'm kind of just uh just describe describing this as a question, >> but um I don't know. I think it's I I I haven't been working on it, but I but I think it's like very very interesting and um uh exciting work. Yeah. >> Yeah. Sounds cool. So speaking of, if people are interested in following uh the work that you and your colleagues at Meter do, how should they go about doing that? >> Yeah. So um uh yeah, going going to our website meter.org. Um we we publish our our research updates um there. I think that there is yeah, there's a way of uh yeah, you you can put in your email and um uh subscribe. We als we also post on on Twitter. Um, meter eval I think is our uh is that I can't remember our Twitter handle. Anyways, >> it'll it'll be in the description. >> Yeah. Yeah. And uh yeah, we're we're we're um we're also we're also hiring um so we're hiring uh yeah, experienced researchers um and um and and research engineers. Um so if if that's you, um definitely reach out and um yeah um we we may be excited to to chat. Um yeah, >> great. Well, um, thanks very much for coming and chatting with me. >> Yeah, thanks. Thanks a lot for having me. This was really fun, Daniel. >> This episode is edited by Kate Brunaut and Amber Donace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at Farabs. Financial support for the episode was provided by the Long-Term Future Fund along with patrons such as Alexi Malafv. To read a transcript, you can visit axrp.net. You can also become a patron at patreon.com/exrpodcast or give a one-off donation at kofi.com/exrpodcast. That's koenfi.com/axrpodcast. Finally, you can leave your thoughts about this episode at axp.fyi. [music] [music] >> [music] [music]

Related conversations

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

6 Jul 2025

Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -4 · 72 segs

AXRP

15 Jun 2025

David Lindner on Myopic Optimization with Non-myopic Approval

This conversation examines core safety through David Lindner on Myopic Optimization with Non-myopic Approval, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -2 · 113 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

Mirror pick 1

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Spectrum vs this page

This page -10.64This pick -10.64Δ 0

This pageThis pick

Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.

Spectrum trail (transcript)