OpenAI just introduced GDPval, a new benchmark that measures whether AI models can match professional work quality across 44 occupations — testing top models like GPT-5, Claude Opus 4.1, Gemini 2.5, and Grok 4 against industry experts.
The details:
• GDPval evaluated 1,320 tasks created by professionals averaging 14 years of experience across 9 economic sectors like healthcare and finance.
• Opus 4.1 achieved the highest scores with a 47.6% win rate and excelled at visual presentation tasks, while GPT-5 led in technical accuracy.
• OpenAI also found that performance tripled from GPT-4o to GPT-5 over 15 months, showing rapid improvement in workplace task capabilities.
Why it matters: Despite the headlines of immediate workforce replacement, GDPval shows even the best models are just reaching parity with professionals on certain tasks