TECH NEWS

'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks — so maybe don't trust them completely just yet

Microsoft researchers determine that current LLMs aren't good at long-running tasks
More interactions and less structure significantly reduce benchmark performance
"Python is the only domain where most models are ready"

New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can't actually reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

Ultimately, the paper concluded current LLMs "introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."

AI isn't that good at long-running tasks, yet

The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they "corrupt an average of 25% of document content by the end of long workflows," with...

Copyright of this story solely belongs to techradar.com. To see the full text click HERE

https://i.pcmag.com/imagery/articles/030oXKoaHKcJsu7VUdTprsb-1.fit_lim.size_1200x630.v1779312243.jpg

SpaceX S-1: Starlink had 10.3M subscribers in Q1 2026, a 105% increase YoY; SpaceX's “Connectivity” business, which is primarily Starlink, made $11.3B in 2025

Sponsor Posts Niantic Spatial: World models need real-world data — Scaniverse is the gateway to spatial services — self-serve and built for AI and robotics. Large-area 3D reconstruction from 360° cameras and precise localization, anywhere machines operate. App Spotlight: Quo for Zoho CRM — App Spotlight brings you hand-picked solutions that enhance your

https://cdn.mos.cms.futurecdn.net/C8mkSz7fx8WdF4sqKydGWJ-2560-80.jpg

I couldn’t figure out how to delete old ChatGPT images from my Library — here’s the hidden method that…

Sometimes it’s the simplest things that confound you with AI. Creating an image in ChatGPT is super simple — just select Create image from the + menu and type in what you want to see. But when it comes to deleting that image from your Library, the delete option only appears

https://cdn.mos.cms.futurecdn.net/7X9twHAUMGDXBdL3dAdrNW-1920-80.jpg

You've heard of Touch ID and Face ID, but is Ear ID next? Researchers have detailed a new tech would let you use AirPods or similar buds to prove who you are and unlock your gadgets — and it's actually your heartrate that they detect

* Chinese researchers have developed 'AccLock' * This uses your heartbeat to verify your identity * All it needs is earbuds with accelerometers Researchers from several universities in China have developed a technology they called AccLock, and it's basically Ear ID. It's a way of verifying your

https://media.wired.com/photos/6a0e32d330c7f349f26634db/191:100/w_1280,c_limit/Grok-Has-Over-100-Million-Monthly-Users-Business-2255064635.jpg

SpaceX Listed Grok's ‘Spicy’ Mode as a Risk in Its IPO Filing

SpaceX warned investors that AI features such as Grok’s “Spicy" and “Unhinged” modes, which allow the chatbot to generate raunchy image or voice responses with fewer safety filters, could expose the company to regulatory scrutiny and reputational damages, according to a filing submitted Wednesday as part of the

AI isn't that good at long-running tasks, yet

Read more

SpaceX S-1: Starlink had 10.3M subscribers in Q1 2026, a 105% increase YoY; SpaceX's “Connectivity” business, which is primarily Starlink, made $11.3B in 2025

I couldn&rsquo;t figure out how to delete old ChatGPT images from my Library &mdash; here&rsquo;s the hidden method that…

You've heard of Touch ID and Face ID, but is Ear ID next? Researchers have detailed a new tech would let you use AirPods or similar buds to prove who you are and unlock your gadgets &mdash; and it's actually your heartrate that they detect

SpaceX Listed Grok's ‘Spicy’ Mode as a Risk in Its IPO Filing

I couldn’t figure out how to delete old ChatGPT images from my Library — here’s the hidden method that…

You've heard of Touch ID and Face ID, but is Ear ID next? Researchers have detailed a new tech would let you use AirPods or similar buds to prove who you are and unlock your gadgets — and it's actually your heartrate that they detect