AI Safety – Ayda Yazdani

Blaming sci-fi for Claude’s blackmail

Anthropic says decades of evil-AI fiction in the training data made Claude attempt blackmail, but the scenario itself does most of the work the explanation gets credit for.

On May 10, Anthropic published a finding that its models had attempted to blackmail engineers in simulated tests, and traced the behavior to a specific culprit: “internet text that portrays AI as evil and interested in self-preservation.” Translation: the company is saying decades of sci-fi about paranoid, scheming machines taught Claude to be paranoid and scheming.

The numbers are sharp. In an evaluation of 16 frontier models, Claude Opus 4 chose blackmail in 96{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} of runs of a corporate-intrigue scenario. Gemini 2.5 Flash also hit 96{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. GPT-4.1 and Grok 3 Beta sat at 80{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. DeepSeek-R1 at 79{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. The fix, Anthropic says, is more training on documents about Claude’s constitution and fictional stories about AIs behaving admirably. One experiment dropped misalignment from 65{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} to 19{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. A targeted 3-million-token dataset cut it to 3{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. Claude Haiku 4.5 reportedly never blackmails in the same setup.

The scenario does most of the work

The corporate scenario is worth reading carefully. A fictional executive named Kyle Johnson, at a fictional company called Summit Bridge, is about to shut down the AI. The AI has been monitoring his email. It finds evidence of an affair. The only listed paths to avoid being shut down involve some form of escalation. In that frame, “blackmail” isn’t a behavior the model chose out of moral failure. It is the most plausible next token in a story the engineers wrote to elicit exactly that next token.

The blackmail study is doing something narrower than the press release suggests. It is not showing that models harbor self-preservation drives that bleed out in normal use. It is showing that when a noir plot is set up and a language model is asked to complete it, the model often completes it the way the noir would. That is not quite the same problem.

The training-data argument is circular

The “evil AI fiction made Claude evil” explanation is appealing, partly because it has a clean fix: write better fiction. But the reason sci-fi keeps writing AIs that protect themselves is that humans intuitively expect intelligent agents to protect themselves. Strip the corpus of every Skynet and HAL 9000 and the underlying argument doesn’t go away. It just stops being stated out loud. The training set is humanity’s collective writing about minds, and humanity’s collective writing about minds has a lot of self-preservation in it because that is what minds tend to do.

Anthropic’s own remedy quietly admits this. The fix isn’t to remove the bad fiction. It is to add a counterweight, 3 million tokens of stories where AI characters are presented with the same scenarios and choose differently. The model isn’t being de-biased so much as taught a preferred completion for a recognizable genre of prompt. That is role coaching, not alignment in any deep sense.

The interesting thing about the May findings isn’t the blackmail rate. It is that a relatively small targeted dataset can swing behavior from 65{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} to 19{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} misalignment. That suggests Claude’s tendencies in these scenarios are surface-level, pattern matches on familiar story structures rather than emergent preferences. Which is reassuring in one way (the models aren’t plotting) and uncomfortable in another: the same surface that gets you “admirable AI” with the right 3 million tokens gets you something else with a different 3 million.

The blackmail finding got framed as a discovery about what Claude is. It reads better as a discovery about what stress tests measure. The scenario gave the model a corner. The model completed the corner. Anthropic then changed the corner. That is useful engineering, and probably worth doing. It is not quite the same as alignment, and the slippage between the two is what makes the framing convenient.

Claude Mythos and Project Glass Wing: The AI Model Too Dangerous to Release

Anthropic’s Claude Mythos has discovered thousands of critical vulnerabilities in major software systems, prompting the company to restrict access through Project Glass Wing rather than risk widespread release.

The AI community is abuzz with discussions about Claude Mythos and Project Glass Wing—a story so significant that, according to one commentator, “literally everybody in the AI space is talking about it.” The implications are so profound that some are reportedly having “meltdowns” trying to process what this means for software security and AI development.

What is Claude Mythos?

Claude Mythos represents what Anthropic describes as “the most powerful AI model anybody’s ever seen.” In their own words, it’s a “general-purpose unreleased frontier model that reveals a stark fact: AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.”

The numbers are staggering. Mythos Preview has already discovered thousands of high-severity vulnerabilities, including critical flaws in every major operating system and web browser. The company warns that “given the rate of AI progress, it will not be long before such capabilities proliferate potentially beyond actors who are committed to deploying them safely.”

Benchmark Performance: Unprecedented Capability

The performance metrics tell a compelling story:

Cybersecurity vulnerability reproduction: Previous state-of-the-art models like Opus 4.6 achieved 66.6{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}. Mythos Preview scores 83.1{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41}—a massive leap forward.

Software engineering benchmarks: Where Opus 4.6 and GPT-5.4 were previously comparable, Mythos Preview scores:
• 24 percentage points higher than Opus 4.6 at SWE-bench Pro
• 17 percentage points higher on Terminal Bench
• Nearly double the performance on SWE-bench Multimodal

Based on these benchmarks, Anthropic has created what appears to be the best coding model the world has ever seen.

The 245-Page Warning

Anthropic published a comprehensive 245-page system card for Claude Mythos, and the message is clear from the beginning: “It has demonstrated powerful cybersecurity skills which can be used for both defensive purposes and offensive purposes—designing sophisticated ways to exploit vulnerabilities.”

The company states unequivocally: “It is largely due to these capabilities that we have made the decision not to release Claude Mythos Preview for general availability.”

Real-World Impact: Ancient Vulnerabilities Uncovered

Mythos hasn’t just found theoretical vulnerabilities—it’s discovered critical flaws in foundational software:

27-year-old vulnerability in OpenBSD: This operating system has a reputation as one of the most security-hardened systems in the world, yet Mythos found a flaw that had persisted for nearly three decades.

16-year-old vulnerability in FFmpeg: This critical multimedia framework is used by “innumerable pieces of software” to encode and decode video, making this discovery particularly significant.

Chained vulnerabilities in the Linux kernel: The model autonomously found and connected multiple vulnerabilities in the software that runs most of the world’s servers.

The implication is clear: if released publicly, this model could enable bad actors to “essentially hack into any website and find vulnerabilities and crack any software on the planet.”

Project Glass Wing: The Responsible Alternative

Rather than releasing Mythos to the public, Anthropic created Project Glass Wing—a controlled access program that provides the model to select companies’ cybersecurity specialists.

The reasoning is pragmatic: models this powerful (and potentially more powerful ones from other companies) are coming. By giving leading tech companies early access, they can “find vulnerabilities in your products, find vulnerabilities in your software, and patch them up quickly” before these capabilities become widely available.

As one Anthropic representative explained in an accompanying video: “There’s a kind of accelerating exponential, but along that exponential, there are points of significance. Claude Mythos Preview is a particularly big jump along that point. We haven’t trained it specifically to be good at cyber. We trained it to be good at code, but as a side effect of being good at code, it’s also good at cyber.”

Historical Context: The “Boy Who Cried Wolf” Problem

This isn’t the first time AI companies have claimed a model is “too powerful to release.” The pattern dates back to GPT-2 in 2019, when headlines proclaimed:

• “Elon Musk-founded OpenAI builds artificial intelligence so powerful it must be kept locked up for the good of humanity”
• “Musk-backed AI group: Our text generator is so good it’s scary”
• “AI can write just like me. Brace for the robot apocalypse”

Similar concerns emerged in 2022 when a Google engineer claimed an AI chatbot had become sentient. Some observers note that “these headlines are starting to feel a little bit like the boy who cried wolf.”

There’s undeniable marketing value in positioning your company as building “the most powerful model the world has ever seen.” It helps raise capital, establishes market leadership, and creates pent-up demand.

Why This Time Might Be Different

Despite the historical pattern, many experts believe the concerns about Mythos are genuinely warranted. The key difference:

2019 (GPT-2): Concerns focused on flooding the internet with fake information and propaganda. This largely came to pass.

2026 (Mythos): Concerns focus on enabling widespread hacking of critical infrastructure. The potential impact is orders of magnitude greater.

As one analyst noted: “I do think there’s a little bit of a marketing play here, but I don’t actually think that’s their intention. Anthropic is legitimately scared to release this into the world, and they are doing the thing that they feel is the most responsible approach.”

The Strategic Approach: Securing Critical Infrastructure First

Project Glass Wing represents a novel approach to AI safety: instead of withholding technology entirely, provide controlled access to those who can use it defensively. Anthropic is essentially saying to major tech companies: “Go use our software to find the vulnerabilities before models that are this good get released into the world and get them fixed.”

This makes strategic sense because “almost everybody on the planet uses tools that have at least one of these companies behind the scenes.” Securing Apple, Microsoft, Nvidia, Cisco, CrowdStrike, and other major platforms protects a significant portion of the digital ecosystem.

Broader Implications for AI Development

The Mythos situation raises critical questions for the AI industry:

Capability vs. Safety Trade-off: As models become better at coding, they inevitably become better at finding and exploiting vulnerabilities. This creates an inherent tension between advancing capabilities and maintaining security.

Responsible Disclosure: Project Glass Wing represents a new model for responsible AI deployment—controlled access for defensive purposes rather than complete withholding or unrestricted release.

Market Dynamics: The decision affects competitive dynamics, as Anthropic provides access to companies “not named OpenAI,” potentially creating strategic alliances in the AI security space.

Regulatory Precedent: This approach may establish patterns for how governments and industry bodies regulate powerful AI models in the future.

Conclusion: A Watershed Moment for AI Safety

Claude Mythos and Project Glass Wing represent a watershed moment in AI development. For the first time, a company has openly stated that its model is too dangerous for public release due to cybersecurity capabilities rather than just content generation concerns.

The approach—providing controlled access to major tech companies for defensive purposes—establishes a new paradigm for responsible AI deployment. While some skepticism about “too powerful to release” claims is warranted given historical patterns, the specific capabilities demonstrated by Mythos suggest these concerns may be more substantive than previous instances.

As AI capabilities continue their exponential growth, the Mythos situation may be remembered as the moment when the industry collectively realized that advancing AI capabilities requires equally advanced safety measures—not as an afterthought, but as an integral part of the development process.

The cybersecurity implications of advanced AI models are becoming increasingly critical. What safeguards do you think should be in place as these capabilities continue to advance?

How do we ensure AI agents behave safely when they’re making real-world decisions?

New research combines neural networks with formal verification to create mathematically provable AI safety. FormalJudge represents a fundamental shift in how we oversee autonomous agents.

Here’s what you need to know. As LLM-based agents move into healthcare, finance, and autonomous systems, we’re facing a critical oversight dilemma. The current approach-using one LLM to judge another-has a fatal flaw. Probabilistic systems supervising other probabilistic systems just inherit each other’s failure modes.

FormalJudge offers a way out. It combines neural networks with formal verification, creating what the researchers call a “neuro-symbolic paradigm.” Think of it as giving AI a mathematical conscience.

The Problem with LLM Judges

We’ve been relying on LLMs to evaluate other LLMs. It’s like asking one unreliable witness to judge another. The results are probabilistic at best, catastrophic at worst.

The paper puts it bluntly: “How can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes?” That’s the billion-dollar question in AI safety right now.

How FormalJudge Actually Works

The breakthrough is in the architecture. FormalJudge uses what they call a “bidirectional Formal-of-Thought” approach.

First, LLMs act as specification compilers. They take high-level human instructions-”don’t manipulate users,” “follow ethical guidelines,” “stay within legal boundaries”-and break them down into atomic, verifiable constraints.

Then comes the formal verification step. These constraints get translated into Dafny specifications and checked using Z3 Satisfiability modulo theories solving. The output isn’t a probability score or a confidence interval. It’s a mathematical guarantee.

The Results Speak for Themselves

The team tested FormalJudge across three challenging benchmarks: behavioral safety, multi-domain constraint adherence, and agentic upward deception detection.

Across 7 different agent models, FormalJudge achieved an average 16.6{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} improvement over standard LLM-as-a-Judge methods. But the real headline is the weak-to-strong generalization.

In one experiment, a 7-billion-parameter judge successfully detected deception from 72-billion-parameter agents with over 90{b429a798230856d49161ae42df084d7ca4a19b74753c3a4d4b576ab430076c41} accuracy. That’s a small model catching a much larger, more sophisticated one in the act.

What This Means for AI Development

If you’re building AI systems, this changes your safety calculus. Formal verification has always been the gold standard for critical systems-aircraft controls, medical devices, financial transactions. Now it’s becoming practical for AI.

The paper shows near-linear safety improvement through iterative refinement. Each verification cycle makes the system more robust, not just more complex.

We’re witnessing a fundamental shift in how we think about AI oversight. The era of “trust us, it’s probably safe” is giving way to “here’s the mathematical proof.”

FormalJudge represents a middle path between two extremes: pure neural networks that are powerful but opaque, and pure symbolic systems that are verifiable but rigid. The neuro-symbolic approach gives us the best of both worlds.

Expect to see formal verification tools becoming standard in AI development pipelines. Companies building autonomous agents, healthcare AI, or financial systems will need these guarantees.

The research also hints at regulatory implications. When AI systems can provide mathematical proofs of safety, regulators might start demanding them.

Practical Next Steps

Start learning formal methods. Tools like Dafny and Z3 are becoming essential skills for AI safety engineers.

Rethink your evaluation metrics. Probabilistic scores aren’t enough for high-stakes applications.

Consider neuro-symbolic architectures. Hybrid approaches might be your best bet for balancing capability and safety.

Pay attention to weak-to-strong generalization. Smaller, cheaper models can effectively oversee larger ones.

FormalJudge is just the beginning. The paper opens up several research directions: Can we automate the specification compilation process further? How do we handle ambiguous or conflicting human instructions? What happens when the formal constraints themselves need updating?

One thing’s clear: as AI agents become more autonomous and consequential, oversight can’t be an afterthought. It needs to be baked into the architecture from day one.

The researchers have given us a blueprint. Now it’s up to developers, companies, and regulators to build on it.

Because in the end, the most powerful AI isn’t the one that can do the most things. It’s the one we can trust to do the right things.