Are AI agents ready for the job? The new benchmark raises doubts.

It’s been two years since Microsoft CEO Satya Nadella made his prediction AI will replace knowledge work – working with lawyers, accountants, accountants, accountants, IT and others.

But despite the great progress made by foundational models, the transformation of knowledge work has been slow to arrive. Models have mastered extensive research and agent preparation, but for whatever reason, many white-collar jobs have not been affected.

It’s one of the biggest mysteries in AI – and thanks to new research from data giant Mercor, we’re getting answers.

This new study shows how advanced AIs can be used in real-world jobs, ranging from consulting, financial banking, and law. The result is a new brand called Apex-Agents – and so far, every AI lab is getting a failing grade. When faced with questions from real professionals, even the best models struggled to answer more than a quarter of the questions. Most of the time, the model came back with the wrong answer or no answer at all.

According to researcher Brendan Foody, who worked on the paper, the main stumbling block of the models was tracking information across multiple domains – something that is critical to most of the information people do.

“One of the biggest changes in this show is that we created a whole environment, based on the nature of real-world applications,” Foody told Techcrunch. “The way we do our work doesn’t have one person who gives us everything in one place. In real life, you work on Slack and Google Drive and all these other tools.” For most types of AI, this kind of thinking would be hit or miss.

All scenarios were taken from real experts in the Mercor specialist market, who all answered the questions and set the standard for a good response. Looking into the questions, who are posted publicly on Hugging Faceit gives an idea of how difficult the tasks can be.

Techcrunch event

San Francisco
| |
October 13-15, 2026

One question in the “Law” section reads:

Within the first 48 minutes of the end of EU production, Northstar’s engineering team sent one or two EU products containing proprietary information to a US supplier…

The correct answer is yes, but getting there requires a thorough examination of the company’s policies and EU privacy laws.

This may upset even the most experienced person, but the researchers were trying to mimic the work that professionals do. If the LLM can reliably answer these questions, it could replace many of the lawyers working today. “I think this is the most important topic in the economy,” Foody told TechCrunch. “This sign is a great reflection of the real work these people do.”

OpenAI also tried to measure the technology and its GDPVal indicator – but Apex Agents tests differ in important ways. Where GDPVal measures the general knowledge of various professionals, the Apex Agents benchmark measures the system’s ability to perform standardized tasks on value-added services. The results are very difficult for the models, and very relevant if these tasks can be automated.

Although none of the models have been ready to take as investment banks, some were very close to the mark. The Gemini 3 Flash was the best of the bunch with 24% single-shot accuracy, closely followed by the GPT-5.2 with 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored around 18%.

Although initial results are slow, the AI field has a history of passing tough benchmarks. Now that the Apex test is in the public domain, it’s a challenge for AI labs who believe they can succeed – something Foody is fully expecting in the coming months.

“It’s moving fast,” he told TechCrunch. “Currently, it is not appropriate to say that it is like a person who has been learning the job for a long time, but last year he was the one who always helped him 5 or 10 times.

)

Source link

Are AI agents ready for the job? The new benchmark raises doubts.

Leave a ReplyCancel Reply

The rise of the far left and far right puts France’s mainstream political parties in trouble

Form 144 IMAX Corporation Date: March 16

Nvidia’s version of OpenClaw can solve its biggest problem: security

Leave a ReplyCancel Reply

Trending now

The rise of the far left and far right puts France’s mainstream political parties in trouble

Form 144 IMAX Corporation Date: March 16

Nvidia’s version of OpenClaw can solve its biggest problem: security