Alibaba's Qwen3.7-Max cracks top of Code Arena WebDev leaderboard, beating OpenAI and Google
Alibaba's Qwen3.7-Max reached fourth on Code Arena's WebDev leaderboard — which measures a model's ability to build web applications from user prompts — surpassing deployed models from OpenAI and Google and standing as the only non-US developer in the top five. The Qwen team separately touted a #3 finish on ITbench-AA, a fresh benchmark testing enterprise IT tasks in an agentic style.
The model is explicitly engineered for agent-driven workflows: coding, office automation and extended task execution. Alibaba's most striking claim is autonomous operation for up to 35 hours without performance degradation — a direct pitch at the long-horizon reliability that defines this week's agentic-coding theme. 'Agentic era, go with Qwen,' the team posted.
The broader significance is competitive geography: a Chinese lab placing at the top of a Western coding leaderboard reinforces that the frontier-coding gap has narrowed, echoing DeepSeek's V4 family and aggressive price cuts. The skeptical read, voiced by Ethan Mollick this week, is that benchmark placement may overstate real-world robustness — he argues open and non-US models remain 'much more fragile, especially out-of-distribution' than benchmarks indicate. Buyers should weigh the 35-hour autonomy claim against independent verification before trusting it on production workflows.