'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks — so maybe don't trust them completely just yet

https://cdn.mos.cms.futurecdn.net/Rb6YDzdRZjccpn6MQ26KML-2560-80.jpg
  • Microsoft researchers determine that current LLMs aren't good at long-running tasks
  • More interactions and less structure significantly reduce benchmark performance
  • "Python is the only domain where most models are ready"

New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can't actually reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

Ultimately, the paper concluded current LLMs "introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."

AI isn't that good at long-running tasks, yet

The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they "corrupt an average of 25% of document content by the end of long workflows," with...

Copyright of this story solely belongs to techradar.com. To see the full text click HERE

Read more

https://i.pcmag.com/imagery/articles/030oXKoaHKcJsu7VUdTprsb-1.fit_lim.size_1200x630.v1779312243.jpg

SpaceX S-1: Starlink had 10.3M subscribers in Q1 2026, a 105% increase YoY; SpaceX's “Connectivity” business, which is primarily Starlink, made $11.3B in 2025

Sponsor Posts Niantic Spatial: World models need real-world data — Scaniverse is the gateway to spatial services — self-serve and built for AI and robotics. Large-area 3D reconstruction from 360° cameras and precise localization, anywhere machines operate. App Spotlight: Quo for Zoho CRM — App Spotlight brings you hand-picked solutions that enhance your

https://cdn.mos.cms.futurecdn.net/7X9twHAUMGDXBdL3dAdrNW-1920-80.jpg

You've heard of Touch ID and Face ID, but is Ear ID next? Researchers have detailed a new tech would let you use AirPods or similar buds to prove who you are and unlock your gadgets — and it's actually your heartrate that they detect

* Chinese researchers have developed 'AccLock' * This uses your heartbeat to verify your identity * All it needs is earbuds with accelerometers Researchers from several universities in China have developed a technology they called AccLock, and it's basically Ear ID. It's a way of verifying your