Signal Room / Editorial

Back to Signal Room
Wes RothCivilisational risk and strategySpotlightReleased: 6 Mar 2026

GPT 5.4 "we see no wall"

Why this matters

Auto-discovered candidate. Editorial positioning to be finalized.

Summary

Auto-discovered from Wes Roth. Editorial summary pending review.

Perspective map

MixedGovernanceMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 14 full-transcript segments: median 0 · mean -1 · spread -102 (p10–p90 -90) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands
14 slices · p10–p90 -90

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes governance
  • - Emphasizes safety
  • - Full transcript scored in 14 sequential slices (median slice 0).

Editor note

Auto-ingested from daily feed check. Review for editorial curation.

ai-safetywes-roth

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video 9zVZVtPMU6Y · stored Apr 2, 2026 · 381 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/gpt-5-4-we-see-no-wall.json when you have a listen-based summary.

Show full transcript
As today is winding down, there's some insane AI news. First and foremost, the GBT 5.4 gets released. It seems like it's going to be pretty good at replacing people potentially. Also apparently has a native built-in computer use capabilities. Pretty sure we've never seen that before, not natively within the model. Non Brown saying this about AI progress. We see no wall. It's getting kind of scary good at economically valuable tasks. And unfortunately, Anthropic has been officially labeled a supply chain risk. The thing that I think we all were hoping wouldn't happen, that it wouldn't come to this. It did happen. It's official. And Anthropic is saying they are going to challenge it in court. The good news, or some silver lining, if you will, is that this applies only to the use of claw by customers as a direct part of contracts with the Department of War. Not all use of claw by customers who have such contracts. The department's letter has a narrow scope and the relevant statute is narrow as well. So, I'm very happy to hear that. That means this won't be a massive below two anthropic, but they're still going to be challenging and it's kind of still sucks that we're here cuz they are now back in negotiations with the government. And I don't know, I kind of was hoping that things were getting better. One thing that really stands out to me about this new release of GBTE 5.4, before. I mean, there's a couple really, but the first one that I noticed that was really interesting is it has a pretty big jump on GDP val. GDP val takes people that have a lot of experience in some particular industry. I believe on average, these are people that have 12 years of experience, management experience, and those experts sort of create a rubric to grade completed final projects that people in that industry do that experts in that industry do. So for example, for a manufacturing engineer, here's a prompt and task concept and an experienced human deliverable. So this is if you're an experienced human, this is kind of an example of what kind of work you would submit to your boss if they ask you to do this particular project. If you're an order clerk, you know, there's sort of a project for you. And this is what an experienced human what the deliverable would look like. Or for a producer, this is what that particular deliverable would look like. We're not going to dive too deep into it. Basically, we're split tested to see who can do it better, a human or one of the AI models. And for a while after its release, you know, the LM models were not as good as humans. So, if you see this line, this parody with industry expert. So, that's humans. And this is kind of what the progress has been. There's kind of the win rate or win rate plus tie rate by different models. So, again, this was when it was one of the first releases of this benchmark. But since then, it has been steadily improving. And now GPT 5.4 Pro and GPT 5.4 are ranking very high on that particular benchmark. So GPT 5.4 Pro is 82% meaning a human expert delivers a job and a GPT 5.4 Pro delivers that completed job. The rubric that's designed by these experts in the field. I think it's actually 14 years of experience. People that that work for these companies like Deote, WS Fargo, Bank of America, Google, etc. et etc. They created rubric to grade these and now these new GBT 5.4 models they're getting 82% or 83% rate of winning versus the human and or tying right so 82% either wins or ties with the human deliverable and just looking at the win rate it's right around 70%. So 70% of the time it just wins is judged to be better than human expert work. So it remains to be seen how this sort of impacts the actual jobs out there. Does this automation replace actual people, actual jobs? By the way, today today's a big day. Today, Anthropic also published the labor market impacts of AI, a new measure and early evidence. So, we're probably going to do a separate video for this because this is a very interesting topic to watch. But kind of the big headline is while there's not really like a big impact that that we're seeing today or at least there's no major impact yet, we are seeing less and less hiring in those kind of early years when people are starting their careers straight out of college in the first few years when they're just building up skills and entering the workforce. We are seeing that area targeted and job growth really slowing down for those particular individuals. And this is more or less the same thing as the Stanford paper that we covered. Kind of finding the same things. Stanford paper used a lot of the anthropic data for their research. So it kind of makes sense. But the big thing that they're realizing here is that currently kind of like the workplace automation that we have is a tiny percentage of what's actually possible. And then of course we have this. So this is for GPT 5.4. It has computer use and vision. But interestingly this model is the first general purpose model with native computer use capabilities. And it marks a major step for developers and agents alike. So you can build agents to complete real tasks across websites and software systems. And I'll show you exactly what that looks like in a second. Somebody already built something pretty cool using this. So it's excellent at writing code to operate computers via libraries like Playright as well as issuing mouse and keyboard commands in response to screenshots. So Playright is a browser automation. So if you want your little AI agent to be able to do various stuff on the web, Playright is one of the tools you might might use, one of the libraries. Here's the thing that should stand out to you like a sore thumb. So OS world verified. So this is where we measure a model's ability to navigate a desktop environment through screenshots and keyboard mouse actions, etc. GBT 5.4 achieves a state-of-the-art 75% success rate. And they also give you some comparisons like what does that mean? Was it 75% good? You know, I mean, yeah, it's state-of-the-art. It's the best. There's nothing better than it, but how good is that? So GPT 5.2 was at 47. So that's a big big leap. It also surpassed human performance which is at 72.4. That is kind of wild, isn't it? So, at this point, it's navigating desktops through screenshots and mouse commands better than humans. This was kind of one of the weakest points for these models for for a long long time. So, one interesting thing is that this can be used for troubleshooting various visual applications like games, right? So this thing can build a game, then use its newfound computer vision abilities to see if the game plays okay, if there's any graphical issues, can it click on all the buttons, etc., etc., etc. And apparently people are already messing around with this and saying that it is indeed working. Cory Ching is saying that he built a tactical turn-based RPG with Codex and GPT 5.4 using Playrate for testing and image gen for the visuals. He says, "I grew up loving turn-based RPGs, so this was a fun one to build." So here's a 45se secondond demo below. So we've talked about this day for quite some time. When will these models be able to actually look at the kind of the visual outputs of its code, video games or websites, etc. and click and interact with them and just continuously kind of iterate and and improve on them. I don't know how many times this was some time ago, but I I I can't even tell you. I can't even count how many times I I typed into claude or or chatg or something like that. It's just a blank screen. Especially with like any 3D graphics that you're trying to display in a browser like 3JS. Very often they're like, "Okay, your racing video game is done. Open it in your browser window. You open it. It's just a black screen." You tell it it's just a black screen. Goes, "Oh, you're absolutely right. What was I thinking? Let me fix it. Oh, I found the glitch. I forgot something something. It's fixed. Uh, try opening it in your browser window." You open it. It's a blank black screen. So, you type in, you know, it's it's still just a black screen. You've seen me testing these models. You've seen this happen actually if you've been home watching this channel. But that's just a small fraction of the amount of times that that I've actually had to do that. Today, if the legends are are true, today's the first day of of of a new era, an era where we no longer have to tell our chatbots that it's it's just a black screen for the fifth time. It's just there's nothing there. Also, it looks like OpenAI is taking a few pages out of the anthropic playbook. Looks like they're beginning to support skills and they're also helping or using skills in such a way that you're able to migrate from Enthropic to OpenAI if you wanted to. They also have their very own chat GPT for Excel. So skills, you know, using these LMS for Excel. So those are both things that they're sort of transferring over from Anthropic. So some of the things that Anthropic did very well, that's being kind of like transferred over and replicated on the OpenAI side. But that's not it because as OpenAI is releasing this new flagship AI model, it's also releasing a suite of financial service tools. So this is similar to what Anthropic's been doing. They've been releasing a number kind of either skills or tools that help claude do various stuff in fields like in the legal field, cyber security and and many many other ones. I think they had a financial one. So OpenAI as of today is doing this as well. So interestingly here they seem to be focusing on kind of the the finance financial industry as the next big target. OpenI saying here that after software engineering finance will see the benefits of model improvements more acutely than any other field. This is from Ryan Brewer who is building finance things at OpenAI. There's actually quite a bit of other things like there's some priority mode. So if you need your answers faster there's a way to kind of tap into this like priority stream. Is it possible it's running on Cerebrus chips? I haven't seen a confirmation of that yet, but does seem to be kind of a fast lane that's added now. Also, you can interrupt the model midstream to to guide it to give it followup directions, kind of change the course of where it's going. Seems like open has some sort of an internal investment banking benchmark. And this release GPT 5.4 thinking is the top scoring one. So, it's at 87% or 873 with one being the top, one being the best. So call it 87%. The GBT 5.2 Pro is at 71%. And they also have Opus 4.6 listed at call it 64%. And this benchmark measures real world finance workflows that often take analysts hours or days to complete like financial modeling, scenario analysis, data extraction, and long form research. So it's good at finance, it's good at Excel, and it's got that computer vision system. And finally, we also have an employee researcher at OpenAI now leaving OpenAI and joining Anthropic. His name is Max Schwarzer. He's a prominent figure. He worked on for example GPT5. Everything we're discussing here. He loved helping to create the reasoning paradigm and scaling up test and compute with polomials. That's Nome Brown. And apparently he was part of actually shipping the 01 preview, the very first reasoning model that we all got our hands on. He's taking all the people that he's worked with at OpenAI, including Sam Alman. As far as I can tell, no drama here. He's not throwing shade at anybody, but he's saying that some of the people he trusts and respects most have joined Anthropic over the last couple years and he would like to work with them moving forward. Those are kind of the main news, but that's just scratching the surface of of what happened today. Open publishes a new research on a chain of thought controllability, so we're probably going to be looking at that in a separate video. And of course, in the last 48 hours, we also had Gemini 3.1 Flash Light Landing, I aka being released by Google. Gro 420 is releasing kind of like the the beta 2. So, I'm off to test this new model, a GPT 5.4. I got to say, I'm pretty excited. This definitely feels like a pretty big leap. Specifically, I want to test out the computer vision, computer use model, or I guess it's not even a model. It's baked in. It's natively part of this new release. So, I want to get my hands on it. As you know, I have my AI agents constantly building and improving the site natural20.com. That's natural20.com. It's a news aggregator. We also have all of the AI benchmarks updated live. Well, actually, not just right the second. Just as I was beginning to record this video, I noticed that all of my AI agents just like all just crashed. They're nonresponsive. I'm not going to lie, I'm very worried. They're all running on the cloud code anthropic o off thing. So, I sure hope they didn't just to pull the plug on me. Although, I guess this isn't the worst day for that to happen. But, if I'm able to get them back online, I do have this demo section where I'm building out some cool things, including with new model releases. I'll try to have a few projects that are built by that model. So, sometime within hopefully next 24 hours, I'll have the GPT 5.4 section here. So, with some of the coolest things that it's been able to build. last release 3.1 Pro demo built the Starlink satellite tracker. I think it's pretty cool. So, you're able to see in real time all of these Starlink satellites floating through space, you're able to to click on them and see which ones they are, altitude, velocity, latitude, longitude, and if you're looking at mobile, if you share your location, you actually see where you are on the map. Also, I'll send my army of drones out there to find you wherever you are. I'm totally kidding, probably, but let me know what you think about a GPT 5.4. Are you underwhelmed? Are you pretty excited? Does this feel like a pretty good, pretty big leap forward? A lot more coming soon as soon as I get my hands on all this. Get to experiment and figure it out. So, stick around. See you tomorrow. I got to go resuscitate my AI agents because I think they've been down for maybe close to an hour. And this is just not how I want to go about living my life. Doing a work that can be automated, doing it myself or like like an animal. This is not okay. Let the robots do the work. My name is Wes Roth. See you in the next

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.

More from this source