Three Microsoft researchers, Philippe Laban, Tobias Schnabel, and Jennifer Neville, ran 19 large language models through a benchmark called DELEGATE-52. The setup is simple. Hand the model a document. Ask it to make a structural edit. Ask it to undo that edit. Repeat for ten round trips, which works out to 20 interactions. Then compare the final document to the original and count what was lost.
In a paper covered May 11, the average frontier model (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted 25% of the document content by the end of 20 interactions. Across all 19 models tested, the average was closer to 50%. The benchmark covers 52 professional domains, from accounting to music notation to crystallography. Out of those 52, exactly one cleared the readiness threshold the researchers set, 98% accuracy retained. That domain was Python.
Why the language with the strictest syntax held up
Gemini 3.1 Pro, the best performer of the group, passed in 11 of 52 domains. The other 18 models passed in fewer. The Python result is not a hidden detail in the paper, it is the headline finding for anyone reading from a CS classroom. Most LLMs, 17 of the 19 tested, handled lossless Python manipulation across 20 interactions. They did not handle lossless music notation, weaving patterns, EDIFACT, earnings statements, or crystallography logs.
The reason is the part of programming that students often complain about. Python has a syntax checker. The interpreter will refuse to run code that has a misplaced colon or an unclosed bracket. There is no equivalent for music notation. There is no parser that will reject an XML earnings statement with a quietly wrong figure inside it. The model can rewrite a number and nothing in the toolchain will catch it. With Python, the errors that compound silently in other domains crash the program instead. The model gets immediate feedback that what it produced is broken, and produces something else.
Read the other way, the finding is uncomfortable. The reason a CS student trusts a model with a refactor is the same reason a paralegal should not trust one with a contract. The thing keeping Python intact across 20 interactions is not the model. It is the language.
Tools made it worse
The agentic configuration is where the result starts to feel like a rebuke of how the rest of the industry has framed 2026. The researchers ran the same benchmark twice, once with a plain language model, once with the model equipped to read files and execute code. The agentic version did worse, by an average of 6 percentage points by the end of the simulation.
The breakdown the authors give is worth listing:
- Context overhead. Tool use consumed 2 to 5 times more input tokens, straining the long-context capabilities the models needed for the actual task.
- Task mismatch. The benchmark is built around textual understanding and reasoning. The tools the models reached for were better suited to programmatic operations.
- Tool avoidance. Faced with a choice between file writes and code execution, models picked file writes most of the time, which defeats the point of giving them an execution sandbox.
The vendor pitch for agentic systems is that they handle long, multi-step tasks. DELEGATE-52 is, structurally, a long multi-step task. The benchmark catches the gap between the demo and the loop.
The detail that lingers is the framing the paper picks for the failure mode. Catastrophic corruption, defined as scoring 80% or lower, occurred in more than 80% of model and domain combinations. The errors are not loud. The paper calls them sparse but severe, and stresses that they accumulate quietly. A musician who delegates 20 small edits to a model gets back a score that mostly looks right and has, somewhere in it, a wrong note. An accountant gets back a statement that mostly balances. The 25% corruption rate cited for frontier models is the rate at which a human would have to be checking.
The first time most CS students see Python passing a benchmark that the other 51 domains fail, the instinct is to read it as a compliment to the language. It is a compliment to its compiler. Strip the syntax checker out and Python would be in the bottom half of the chart with everything else.