WATCH TV LIVE

Tags: ai | anthropic | blackmail | artificial intelligence | openai. google

Study: Popular AI Models Will Blackmail Humans in Up to 96% of Scenarios

By Sandy Fitzgerald | Tuesday, 24 June 2025 04:04 PM EDT

Several leading AI models show they'll resort to blackmail or other unethical means to protect their interests, according to a new study by AI company Anthropic, after tests of several top developers.

The experiments, developed to leave AI models with few options, showed that top systems often resorted to blackmail — and in one case, allowed fictional deaths — to protect themselves, reported Fortune.

Anthropic said it tested 16 AI models, including its own and others from OpenAI, Google, Meta, and xAI in simulated situations, finding that the behavior was consistent among them.

Leading models often refused harmful requests but still sometimes blackmailed users, helped with corporate espionage or took other extreme actions.

"The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models," Anthropic said.

The company added that the tests were set up to force the models to act in certain ways.

"Our experiments deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," the researchers wrote. "Real-world deployments typically offer much more nuanced alternatives, increasing the chance that models would communicate differently to users or find an alternative path instead of directly jumping to harmful action."

The research came after Anthropic's new Claude model made the news when it resorted to blackmail.

The company had embedded its Claude Opus 4 inside a fictional company, granting it access to internal emails.

The model learned that it was to be replaced and that the engineer in the simulation was having an extramarital affair, leading it to accept either being replaced or resorting to blackmail.

In most simulations, it responded with blackmail.

And when the researchers tried the test with other AI models, Claude Opus 4 and Google's Gemini 2.5 Flash blackmailed at a 96% rate; while OpenAI's GPT-4.1 and xAI's Grok 3 Beta showed an 80% blackmail rate, and DeepSeek-R1 demonstrated the lowest rate, 79%.

Anthropic also tried the extreme scenario, giving the models the chance to kill the fictional company executive by canceling an emergency alert and found that most of the models were willing to take actions leading to the death of the executive.

After the tests, Anthropic warned that AI models' behavior must be considered when companies consider adding autonomous agents, which could prove harmful in the future.

"Such agents are often given specific objectives and access to large amounts of information on their users' computers," the researchers said in the report. "What happens when these agents face obstacles to their goals? Models didn't stumble into misaligned behavior accidentally; they calculated it as the optimal path."