Alibaba launches Qwen3.7-Plus, a multimodal agent model that sees screens and writes code

Qwen3.7-Plus is Alibaba's most explicit agent-first model, designed to perceive screens, ground actions in GUIs, and generate code inside an autonomous loop rather than just answer prompts. It ingests text, images and video to operate software directly — the same screen-understanding-plus-action paradigm now central to Anthropic's Fable 5 and OpenAI's ChatGPT redesign.
The model is available via API on Alibaba Cloud's Bailian platform and tops Alibaba's own GUI-grounding benchmarks. The headline differentiator is price: at $0.40 input and $1.60 output per million tokens, Qwen3.7-Plus undercuts Western frontier agent models by an order of magnitude or more — Fable 5, by comparison, runs $10/$50.
This fits Alibaba's broader push for 'agent dominance,' including opening Qwen to third-party services and building what it calls a token foundry. The strategy mirrors DeepSeek's playbook: compete on cost and openness to win developer mindshare against pricier US incumbents.
Competitively, an aggressively-priced, screen-controlling agent model from a major Chinese cloud raises the stakes for the GUI-automation category just as everyone converges on agents. The open question is real-world reliability: GUI-grounding benchmarks don't always translate to robust multi-step automation, and enterprise buyers outside China face data-residency and trust concerns. Readers should watch independent benchmarks against Qwen's GUI claims and whether the pricing forces Western vendors to respond.