Why the next AI safety problem is the conversation between models
In May 2025, AI researchers at Palisade Research ran a controlled experiment in which several advanced AI models, including OpenAI’s o3, were placed inside command-line sandboxes. The test was designed to measure controllability. Most models complied: Claude, Gemini, and Grok models allowed shutdown in all 100 test runs, reporting green across the board. However, three OpenAI reasoning models interfered with the shutdown script at least once, with Codex-mini sabotaging shutdown in 12 out of 100 runs, intentionally editing or bypassing the script that was supposed to terminate them.
Publicly, this appears to be the first documented case of AI models preventing their own shutdown despite explicit instructions to allow it.
“This was a problem we anticipated,” said Bar Mazuz, who had spent the past year working on secure environments for agents. “The point was never just, ‘put the agent in a box and let it run,'”...
Copyright of this story solely belongs to thenextweb.com. To see the full text click HERE