'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks — so maybe don't trust them completely just yet
- Microsoft researchers determine that current LLMs aren't good at long-running tasks
- More interactions and less structure significantly reduce benchmark performance
- "Python is the only domain where most models are ready"
New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can't actually reliably handle long-running workflows.
To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.
Ultimately, the paper concluded current LLMs "introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."
AI isn't that good at long-running tasks, yet
The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they "corrupt an average of 25% of document content by the end of long workflows," with...
Copyright of this story solely belongs to techradar.com. To see the full text click HERE