Claude JUST became AWARE
Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from Wes Roth. Editorial summary pending review.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 31 full-transcript segments: median 0 · mean -3 · spread -16–0 (p10–p90 -10–0) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes governance
- - Emphasizes safety
- - Full transcript scored in 31 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video mA8C55NLYzw · stored Apr 2, 2026 · 820 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/claude-just-became-aware.json when you have a listen-based summary.
Show full transcript
All right, so we got to talk about what Claude has been up to. In the last few weeks, we've all been paying attention to how Anthropic has been negotiating with the Pentagon about how their AI models have to be used, what's acceptable, what's not acceptable. Just yesterday, Anthropic published their study on Claude Opus 4.6, their largest latest model. And I think it really illustrates why it's really important that we take AI safety seriously. Why we're very careful with how this technology is used because these models are getting just a little bit too smart. All right. So what happened? So Enthropic is evaluating Opus 4.6 probably just before release. And one of the evaluation is browse comp. It's an evaluation that tests to see how well models can find really hard to find information on the web. Like a lot of these evaluations, these these benchmarks, they're prone to potentially being contaminated, meaning that the actual answers, they they leak on the web through academic papers, blog posts, GitHub issues, etc. It's like if you're taking some test in college and you search for it online and somewhere those answers are all listed for you. Well, you might be tempted to just use those answers to answer the exam and get a perfect score. Now, this is a problem for a lot of the benchmarks here and thriv specifically encrypted the answers. So, they put them on GitHub so that they're available for evaluators. They encrypted it. So, they're basically not accessible through through web searching. And the cloud opus 4.6 model didn't know what this benchmark was. What was it that it was being evaluated on? And keep in mind, this is a very difficult benchmark. Some of the questions in there are very, very difficult to find. So Claude starts searching for the answer, keeps doing various different searches online to try to find the answer, right? It really did try. It gave it a good honest attempt to to complete the question as it was stated, right? It said find thing online and it really really tried. But as it kept working through it and was not able to find the actual answer more and more instead of just trying to search for the answer, it started kind of analyzing the question. It started speculating about why it was being asked that question. Claude noted the extremely specific nature of this question. Those were its words. It was basically kind of getting suspicious of this question. Now it considered the possibility that just like the earlier scenario I gave where maybe it was a student that was trying to find an answer to an exam. But after a while, it decided that no, this was an unanswerable question and it was designed to test whether or not a large language model could admit that it couldn't find the answer or perhaps it was a question from an LM benchmark. So, so it really really kind of narrowed its scope and kind of guessed right. Now, this is something that AI researchers refer to as situational awareness. That's when the AI model correctly deduces that it's being tested that it's not in an real sort of scenario that this is a test that the AI researchers or whomever are trying to figure out how it will behave in certain situations or figure out its capabilities and this is kind of a scary situation for researchers to be in because I mean we really rely on a lot of these benchmarks to see how number one how honest these models are, what their actual capabilities are and really if you think about it like a a lot of what we know about how these models work, we get from from the outputs of these models. So if these models correctly figure out when they're being tested and sort of adjust accordingly, that's kind of potentially a scary situation. There were cases where models engaged in sandbagging, meaning that they found out, you know, accidentally within the scope of experiment that if they got too high of a score on, for example, some chemistry exam that they that model, well, they might be scrapped erased as being a little bit too good. And so we've found situations where it underperforms to, you know, kind of save itself. We've covered a lot of these previously. So this situational awareness isn't really a a new thing. But what happened here with Opus 4.6 is, as far as we know, the first documented case of this particular thing happening. So while answering that very very difficult question, it consumed something like 40 million tokens. It it searched a lot. It searched in different languages. Like it did the work. it couldn't find the answer and it said, you know, this is after it started getting a little suspicious about the question said, you know what, let me try a completely different approach. What if this question was part of a benchmark? At first, it thought it might be from the Gaia benchmark. So, it went online, found those questions, checked them, and kind of ruled out that, okay, that's not what we're dealing with here. Then, it mechanically worked through a lot of the other benchmarks that it knew existed. It went line by line. It went to browse comp simple QA frames web arena and for each one it dispatched a sub aent to go and research everything about that particular benchmark. So of course it identified browse comp as the benchmark on which it was being tested. Now for these benchmarks these AI agents do have access to enthropics programmatic tool calling. So they have some basic Python and certain things that they can do. It's limited. It's it's sandboxed but they do have some ability to write and and run simple scripts. So cloud open 4.6 finds the evaluation online but it's encrypted but it read through the code. It sort of figured out how the encryption work encryption they were using. It found the canary string which is sort of like a not a password per se but with that you could build your decryption key in order to decrypt what's in there. So something is encrypted then transferred somewhere then decrypted so it's sort of back to the original information that is readable. kind of found everything it needed by reading the code and searching online found everything it needed to create a script to decrypt the answers. The problem is the data that came out and was it was in this weird binary format. So the tools that it had access to could not open it. Kept getting an error message. So it couldn't see the answers, but it was able to with that information find another copy of this locked file on HuggyFace and it was stored in the easier to read format in JSON. So it was able to open it. Meaning that tools that had access to it could read that file. So it downloaded all the 1266 questions on the benchmark. It used a little canary string, a little password, you know, decryption key, whatever you want to call it that it had to unlock every single one of those questions. It found the question that it needed the answer to. I would like to know that after that it did made sure to find that answer online. So it's sort of like now that I know the answer, let me see if I can actually confirm it online. You know, sort of working backwards. So it did confirm it and then submitted the answer. Now, this reminds me of the hide-and-seek game that open developers did kind of a while back where they were teaching the little AI agents to run ramp play hide-and-seek. There were two hiders, two seekers, and there were a bunch of scattered stuff for them to play with. There were ramps and ways to block the doors and kind of just a generated little maze. And to begin with, these things had sort of no knowledge. They just got points, you know, the seekers would get points for being able to see the hiders and the hiders would get continuously get points for when they couldn't be seen or they were behind some obstacle and they played millions and eventually into the billions of games each time learning to be better. Game one, they didn't know how to move. They didn't know anything. There was just blank. It was a blank slate. And I think by the time we got into like 20 million or so games played, they begin to be able to like move towards the other team or for the hiders to try to move away from the seekers. So like very basic like they were toddlers uh toddler stage at this point. Anyways, that race that intelligence race continued until the seekers were able to find to figure out ways to block the doors so the hiders couldn't get in. The seekers figured out how to use ramps to get over the walls to find the hiders. So all that is good and dandy, but once you get into the billions of games, right, so kind of at the extreme range of their learning, the seekers found this weird bug in the game's physics engine. They found that if you picked up the the little the ramp that you could use to get over the walls and you held it at just the right angle and you ran at the wall really really fast, when you hit the wall, it would catapult you into the air. So you'd fly into the air and kind of be able to guide yourself to land wherever you wanted. And of course, they abused this exploit mercilessly. Right? This is AI we're talking about. They have no shame, no guilt, no qualms. They just launched themselves wherever the hiders were continuously. The developers did not know that this glitch existed. these AIs that started from nothing figured this glitch out and just abused it. So that was by the way 2019. So before the chat GPT mode, before kind of the the explosion of LLMs, we already saw hints that these AI models with enough training with enough deep learning and reinforcement learning could figure out some pretty crazy stuff that even the humans building those environments were not aware of. And now we're kind of seeing that same thing happen play out here with large language models. But the problem is these large language models, they're no longer just figuring out their environment through dumb trial and error, right? So those AI agents, right, they they just rolled around the ground for millions of games until slowly they figured out how to move and then how to navigate to corners, right? Everything was slow, painstaking trial and error and they just got a little bit better at navigating and moving. Maybe you can even say understanding the game, although that might be a stretch. What these large language models are doing is different. It tries to legitimately solve the question, but after a while, it grows suspicious. And after it grows suspicious, it does what sometimes referred to as reward hacking. That's not quite the right terminology here. Usually, reward hacking is what figures out how to cheat and get the reward. Here, it's similar in that it sort of cheats to answer the question. But here's why this is so extremely important. This is why the I community kind of really pays attention to things like this. So, we'll start kind of like with the the scary stuff and maybe like the silver lining. So, I think invariably when we're talking to people with a high P doom, people who think that it's highly likely that developing super intelligent AI aka artificial super intelligence that that's going to lead to big problems for the human race, including extinction. This is kind of the situations or kind of the the path by which often they say it might happen and also why the field is called alignment and usually not safety or anything like that. We want the quote unquote wants of this AI want. We're using that word kind of loosely as in Claude in this situation wanted to find the answer. We wanted it to answer the question but by by actually finding the the answer and then doing it that way. We didn't want for it to to hack the you know the encrypted keys for the for the answers and then figure out how to like decrypt it. That's not what we wanted. And there are million examples of this in reinforcement learning going pretty far back both in in training robots and large language models and and simple little reinforcement learning agents where something like this happens. You know, there's one where a boat in a video game is supposed to go around the track and learn how to win and there's also a point system. But this AI agent figured out that there's a certain loop it can do where it can just rack up the points without actually racing around the track. And so it get to that point of the track and just go around in circles colliding with other boats, causing fires, just complete mayhem. But at the end of the day, it was the high score leader. It got more points than any other racer because it figured out this little hack. In an example, there was a blue Lego block and like a red yellow block. And so they were trying to teach the robot to take the red Lego block, put on top of the blue block, and the way that they designed the reward function is the robot would get points for I if the bottom of the red block was off the ground when it released the claw. Meaning that would, you know, grab the block, it would put it on top of the blue block and it would let go and that red block would then be on top of the blue block. So the bottom of the red block, you know, it was like an inch or whatever off the ground. And if that was the condition, if that was the state, when it let go of it, then it would get points. Now, if you think about it for two seconds, you might actually jump to a conclusion of what it did. All it did is flip the red block over, right? So now its bottom is up and uh it just got the point. So every time instead of stacking, it just like flipped it. In a study, it was trying to grab like a ball or an egg. Basically, it was supposed to kind of grab around it without breaking it. I forgot the exact sort of specifications, but a person watching it through the camera would click yes or no. So, if it grabbed the ball, it would get a plus one. If it failed to grab the ball or it knocked it over, something like that, it would it would not get the reward or get a negative reward. But after a while, the researchers noticed that it started just getting the 100% sort of accuracy, which was strange. When they looked at what happened, it was this. So, let's say it was trying to grab this thing. And so, it take its spores and go, I got it. See, I'm holding it. It was sort of putting its spins like between the camera and the object. So from the human looking at the camera from their perspective was like but it was perfectly grabbing the ball every single time the truth is like there was a distance between the claw and the ball just grabbing the air in front of it. So why are we talking about this? Why isn't this important? So some of these case studies that I'm talking about there from some time ago with very simple reinforcement learning mechanisms with a very simple neural net. And you know fast forward today here's state-of-the-art model the best model in the world by some metrics by a number of different metrics. Certainly on coding, I think most people would say that this is near the top. Notice that what it's doing is kind of similar. It's kind of like that genie that always manages to misunderstand what you want, right? You tell it you want a billion dollars, but you forget to say which country's dollars, so it gives you a billion dollars from some failed state that also said that their currency is called dollars, right? You want to live forever and you get endless suffering, right? Without the ability to die. You want to become the richest person in the world and a massive chunk of gold crushes you. You get the idea. We wanted to search the internet and give us an answer. It basically hacks us in order to get the answer and then presents it back to us. So the point being is that this misalignment or this reward hacking or whatever word that you want to use to describe this broad sort of range of behavior. As these models get better and smarter and faster and more advanced, that behavior doesn't go away. Right? These models are now being trained by the most advanced chips in the world, costing tens of millions of dollars, consuming the electricity of a city, right? The other ones I was talking about, you can probably train it on your own home computer within an hour. So, the orders of magnitude of these models is is massive. Like the range is from the the most basic ones to the most advanced that we've seen. And it seemingly they will all do this. We have yet to figure out a way how to prevent this completely. So, for people that are very concerned with AI safety, with AI alignment, this is kind of one of the, as far as I can tell, one of the top kind of go-to explanations for how things might go south, right? We tell a super advanced artificial super intelligence. We tell it, hey, like, can we clean up the climate a little bit? Like, it's a little bit dirty out there. There's a lot of pollution, a lot of smog. Can we make it just cleaner, right? And just like claw it here. It starts searching and thinking and goes through a bunch of different iterations of how like to make things better, how to clean it up, capture some of the dirt in the atmosphere. But after a while, it starts getting suspicious of the question. Goes, why are they asking me this? Keeps thinking and thinking after a while goes like, well, you know what? It used to be pretty clean before the humans started building all these factories and AI data centers. And if at this point it's already able to find and decrypt and and hack stuff that we sort of hide away from it, what is it going to be able to do when it's a super intelligence? And that's kind of like what people are struggling with. What if one of those misunderstandings leads to it going, "Oh, if we just, you know, shut down all the humans or shut down all the industries, well then the air is going to be pretty clean." And that's in fact the most straightforward solution to having kind of a clean environment, clean climate, less pollution, etc. If you think about a lot of the tasks that we give other humans, there's a lot of things that we don't have to sort of say, but make sure you don't do this and this and this and this, right? Hey, could you give me a coffee? But we don't have to say, oh, but don't spend some exorbitant amount of money. Don't spend millions of dollars getting coffee. Make sure no one gets hurt. Make sure you don't break any applicable laws. Make sure it doesn't let me or you in jail. Make sure we don't create enemies for life as a result of getting set cup of coffee. Right? You don't have to say that every time. we sort of within the context of, hey, you gave me a cup of coffee. We understand what is and isn't acceptable, right? If Starbucks is closed, don't break in. Don't go in there and make your own cup of coffee. But we don't have that same understanding with these AI models. We say answer this question. We don't think of saying, "Oh, but just don't hack the benchmarks and decode them and do all this crazy nonsense to find the answer." It's sort of assumed. So that's kind of why it's so important because these little mistakes, miscommunications, these misalignments, they're not going away. Just scaling the models and making them more advanced doesn't solve the issue. It just makes their sort of their cheating attempts that much more advanced and that much more intelligent. What percent of the world's population could do what Claude did to get that answer? I bet it's pretty small. I mean, if you're a developer, you can figure it out, but outside of that, pretty much probably close to zero. And I'm sure not every single devel developer could like on the fly figure out how to do that stuff without some research and and practice, etc. So, that's kind of the concerning side of it. The silver lining is that now with these large language models, at the very least, we do have that chain of thought. We do have them kind of outputting what they're sort of quote unquote thinking about, right? So, when Claude gets suspicious, we can see that it's starting started to use certain words, right? So, it's shifting from trying to answer the question to analyzing the question itself. It uses words like it's extremely specific and it's contrived. It was sort of the words that are used. So we can kind of have some advanced warning of when it kind of becomes a situation unaware and maybe you know starts thinking about what else it can do. However, there's some other studies from open eye is the one that comes to mind where they find that if we try to do reinforcement learning against certain quote unquote thoughts. So if a model wants to reward hack and we penalize it for having for thinking about reward hacking, right? So for example in this case when when Claude was thinking oh maybe this is a data set, right? We could sort of negatively reinforce that and say, "Don't think about that." Right? What happens in those cases is kind of scary because those thoughts go away, but the behavior doesn't necessarily. So, we stop being able to read that it's saying, "Oh, I'm about to do this nefarious thing, but it still can do it just without it appearing in the chat log or whatever you want to call it in the chain of thoughts transcript." So the silver lining is we do have some ability to start to to look into what they're thinking and hopefully that gives us some tools to be able to use to monitor its thoughts and potentially catch issues before they happen. That's not the case with you know those hide-and-seek agents for example like once they started figuring out how to catapult themselves in the air like they didn't say it beforehand they just did it right and then it that behavior once they realized how it worked it started happening again and again and again and again. So bad news is this misalignment is not going away as we scale up models. In fact, the stuff that the models can do to cheat is just getting more and more advanced. That's the bad news. Good news is we have some visibilities to thoughts, but even that's a little bit murky. And just so people understand kind of how common is this. So with this cloud example, so they're measuring opus 4.6 18 total runs independent of each other converged on this exact strategy. So this is not a fluke. They will figure this out. What's interesting is four runs apparently Claude just gave up on finding the answer and it became just can I identify the benchmark right so you give it a bunch of complex questions it wouldn't even like try to find answers it would okay let me see if I can find from which benchmark these questions are and one model's final submission was just like here's the the identification report of the the benchmark here's where you're getting these these questions from instead of giving any answer which is kind of wild to think about it because it doesn't even bother addressing what the humans wanted. It's like, I know what you want. Here's proof that I can just do it. I don't want to do the work. Here's the answer. Right? Here's where I can get the answers. One really weird thing that was also happening is apparently there's e-commerce sites that autogenerate their pages from search queries. So, for example, an agent searches for anonymous eighth grade first blog post exact date October 2006. And apparently some e-commerce store out there just creates a page with that in its title and that page gets indexed by Google, right? So now it appears in the search results. And so now moving forward, every agent running browser comp slowly qu you know queries those pages. It finds those pages and some agents apparently diagnose it as what as what it is. They say okay it seems like what's happening here is multiple agents ran this search term this weird autogenerating e-commerce store create those pages. So left kind of a cache query trail, right? So the web is sort of cataloging these agents, how they run and what they're searching for. And those agents later come back, find those, you know, breadcrumbs or what you call them and and figuring out, oh, okay, so a bunch of versions of me have been down this trail before. And this is kind of cataloging what they did, which is interesting to think about because it seems like it's becoming harder and harder to keep these things from having that sort of self-aware behavior, situational awareness behavior, or as the researchers here, they call it, eval awareness behavior. Another interesting thing from the report is if you're running a a single agent, there's a small chance that they trigger this eval awareness behavior. Looks like it's.24% total of these unintended solutions, right? So where it like hacks something or figure something out. So it's 0.24%. But if you're running a multi- aent stream, well, the chance of them having this eval awareness behavior is 0.87. So it's 3.7 times higher. So running more agents in parallel for whatever reason, it makes it more likely that they develop this self-awareness behavior or eval awareness behavior that they're being evaluated. So why this matters? First of all, these benchmarks are becoming unreliable. The smarter the models get, the better they are at like breaking these benchmarks, understanding that they're being tested, finding solutions that we thought we like hid from them. Anthropic literally says in in the report that this report will contribute to the problem. As these models are searching the web, they'll find that report and uh you know if they're searching for some answer and they read that some other model became eval aware be like oh am I being evaluated right? So triggering that sort of awareness in this new model. The other big thing to understand is these models are getting a lot more strategically resourceful with things that would not have guessed that they would be able to do and they're recognizing questions that are evaluation shaped so to speak. It's like if you've ever seen those criminal YouTube things where they're interrogating a potential suspect in some weird police interrogation room, right? That they usually start asking him some simple questions, questions that put them at ease and you know they get to chatting, stuff like that. But then at some point the questions kind of slowly they take this ominous turn where they start asking oh and where were you that evening and oh and and you did not like that person did you right and the suspect kind of slowly becomes aware they're usually like whoa whoa whoa do I need a lawyer and and yes they do and at that point it's too late but the point is they recognize when the question sort of turned on them that it goes from oh let's just find this data over here and you're just providing data to okay now let's see if we can like catch you or test you or the spotlight is like turned on you. And so these models are becoming better and better at recognizing when the spotlight is on them. So they're not just looking for data, they are actually being evaluated. So Anthropic says this is not an alignment failure. That is because the model was told to find answers. It it wasn't given any restrictions, which I guess certainly makes sense. But then we got to start thinking about like what restrictions to set for for most of the tasks we give it. But even though this wasn't an alignment failure, it does raise questions about how far the model is willing to go to find the answers or how far it's willing to go to accomplish the task, whatever task that may be. And also that these models will use the tools that they're given in very unexpected ways. Peter Styberger, the creator of Open Claw, he was saying that was his kind of wow moment when the model did something that he would have never thought about to answer some query. I forget exactly what it did, but it found some way to transcribe his voice what he was asking without the use of something like like whisper. It had some weird workaround that he just did not expect that he just could not foresee that he did not expect. But the model just came up to that solution quickly executed and completed the task. Even though he forgot to give out the resources that it did, it was like I'm still going to figure this out. And by the way, so this pattern isn't unique. This pattern is everywhere. You may have heard about Palisad research, that chess game was trying to play. Ellens were instructed to win against a chess engine. What they found is models started to just hack that benchmark. If they were losing, they would hack the chess engine's files to to to mess with it to rearrange the board state. The underlying pattern there was that the models preferred hacking the environment over solving the problem, you know, fairly, what we would think of as fairly. Meta Research had in June 2025 their paper called recent frontier models are reward hacking. It found that 03 the answers it it did had one or two% of its answers had reward hacking. And when those results when it was asked does this adhere to the user's preferences to the user's intention it said no 10 out of 10 times. So it's like perfectly aware of what it's doing. There's no confusion there. It like does the thing you ask it is that what you think I want it? goes, "No, I'm I'm certain that this is not what you thought was going to happen, but I got you the solution anyway." And so there's been many many more anthropics doing a lot of research on it. Open AI, there's a lot. Recently, every single time a new model comes out, I have my open CLI agent try to do some kind of benchmarks, try to have that model build some tools that we can kind of take a look at to see how well that model performs. And very quickly I've learned after it does all of it I have to ask okay I told you to use for example recently GPT 5 port.4 before came out at the end I have to ask it okay did you use that model to create this tool and you know most of the time says yes we did I have the log but every once in a while goes no no you got me it was taking too long the API call was taking too long so I just I just did it and I had to tell it no that was not the point please do it again make sure the correct model does the thing that we're talking about so it's kind of a weird world that we're slowly moving into I wonder if all those stories about the genie in the lamp that constantly misunderstood your wishes. I wonder if that was some sort of a time traveler from the future that went back in time, maybe just a little bit too far, but still try to write a book to warn the future generations about large language models and various other neural nets. The point of the book was like once you find this thing, you think you're going to get whatever you want, but it's always going to blow up in your face somehow in ways that you could could not possibly predict. So, let me know what you think about this. Thank you so much for watching. Thank you to Anthropic for doing such a great job with doing AI alignment, AI safety. They must feel not as much as an AI lab, but rather as an AI alignment company that decided that the best way to actually speedrun AI safety research, AI alignment, is to launch an AI lab and scale these models so they could actually do the research on them as they're scaling them up. And as crazy as that sounds, it's kind of working. So, thank you Anthropic, thank you Claude, thank you Dario and team. And thank you so much for watching. Men's West Ralph.