The Agent Age: The AI That Learned to Use a Mouse

The Agent Age: The AI That Learned to Use a Mouse

Picture someone at a desk, moving a mouse across a screen. They open a spreadsheet, scroll to row 47, copy a product ID, switch to a web browser, paste it into a search field, click "Look up," read the result, and type a summary into a chat window. Nothing about this requires genius. It requires eyes, hands, and the patience to do it five hundred times.

Now picture the same sequence with no one sitting at the desk. The cursor moves on its own. The spreadsheet opens. Row 47 is found. The product ID is copied, pasted, searched. The result is read. The summary is typed. Every click, every scroll, every tab switch happens exactly as a person would do it, except the "person" is a language model that can see the screen.

That's not a hypothetical. On February 16, 2026, Alibaba released Qwen 3.5, and the most interesting thing about it isn't that it's fast, or cheap, or multilingual. It's that it can look at any application on any operating system and use it the way you do.

What "visual control" actually means

Most AI tools today work through APIs: structured connections that let software talk to other software behind the scenes. If you want an AI to file a support ticket, someone has to build a connector between the AI and the ticketing system. No connector, no automation. This is why most enterprise "AI transformation" stalls at the integration stage: the AI is smart enough, but it can't reach the tools.

Qwen 3.5 skips the connector entirely. The model was trained from its earliest stage on text, images, and video together. Alibaba calls this early-fusion multimodal training, meaning visual understanding wasn't added to a text model after the fact; it was built into the foundation from day one. The result is a model that doesn't just describe what's on a screen. It understands interface elements well enough to interact with them.

The benchmarks back this up. OSWorld is a desktop automation test that drops a model into a real operating system, primarily Ubuntu Linux, and asks it to complete tasks like "find the cheapest flight on this travel site" or "rename all files in this folder that match a pattern." It then checks whether the model actually succeeded. Qwen 3.5 scores 62.2 on OSWorld, 66.8 on AndroidWorld (a similar test for mobile apps), and 69.0 on BrowseComp, a web navigation benchmark where its score rises to 78.6 when given the same multi-step planning tools that top models from OpenAI and Google use. These aren't toy demos.

The model itself is architecturally unusual. It contains 397 billion learned settings in total, but activates only 17 billion for each chunk of text it processes, using a sparse mixture-of-experts design. Think of it as a building with 397 billion rooms where only 17 billion lights are on at any moment; the model routes each task to the specific "rooms" that are most relevant, keeping costs down. Alibaba says this makes Qwen 3.5 60% cheaper to run than its predecessor and 8.6 to 19 times faster at generating output. It's released with an open license, meaning anyone can download, modify, and run it for free.

Why the timing matters more than the model

A single model that can click buttons is interesting. What makes this moment worth paying attention to is everything happening around it.

On February 5, OpenAI launched Frontier, an enterprise platform for building, deploying, and managing AI agents across company workflows. HP, Intuit, Oracle, and Uber are already using it. Three days after Qwen 3.5, Microsoft's Agent Framework hit release candidate, unifying their Semantic Kernel and AutoGen projects into a single open-source system for orchestrating agents in Python and .NET. And on February 17, one day after Qwen 3.5's release, NIST announced the AI Agent Standards Initiative, a federal effort to develop security, identity, and interoperability standards for autonomous AI agents.

In a single two-week window: the deployment platform, the development framework, the regulatory scaffolding, and now the visual control capability. Each piece solves a different problem. OpenAI's Frontier gives agents a managed environment to run in. Microsoft's framework gives developers tools to build them. NIST's initiative addresses the trust gap that makes enterprises hesitant. And Qwen 3.5 removes the last major integration bottleneck by letting agents interact with software through the same interface humans use.

The old question was "Can AI reason well enough to do knowledge work?" That question has been answered for a while now. The new question was "How do we connect AI to all the messy, fragmented software that companies actually use?" The API-by-API approach works but scales slowly; every new application needs a new integration. Visual control offers a different path. If the agent can read a screen and click a button, it can use any software that a person can use, from a 1990s internal tool with no API to a brand-new cloud application on launch day.

What this means for the next twelve months

The practical implication is simpler than it sounds. Companies spend enormous amounts of time and money on what consultants politely call "process automation" and everyone else calls "copying data between systems." A procurement team copies invoice numbers from email into their company's accounting software. A compliance analyst opens ten browser tabs, reads each one, and fills out a checklist in a spreadsheet. A recruiter takes notes from a video call, then manually enters them into an applicant tracking system. None of this work is hard. All of it is expensive.

If visual agents get reliable enough (and "reliable enough" is doing a lot of work in that sentence), these tasks become automatable without anyone building a custom integration. The agent just watches the screen and does what the person would have done. The bottleneck shifts from "do we have an API for that?" to "do we trust the agent to click the right button?" That's still a real bottleneck. Qwen 3.5's OSWorld score of 62.2 means it fails roughly four out of ten desktop tasks. But the trajectory is steep: a year ago, the best models scored in the mid-20s on this same test.

Here is what I think happens next. Visual agent capability becomes a baseline feature for the leading AI models by the end of 2026. OpenAI, Anthropic, and Google are all working on similar capabilities, and Alibaba just made its version free for anyone to use. Companies will start piloting visual agents on their most tedious, lowest-risk workflows: data entry, form filling, report compilation. The ones that work will expand. The ones that fail will fail visibly, because an agent clicking the wrong button in a live system is the kind of mistake that gets noticed immediately.

The age of the agent isn't arriving because AI got smarter. It's arriving because AI learned to use a mouse. In the next Agent Age installment, we look at what happens when the builder who made the most-starred personal agent on GitHub joins OpenAI — and what his hiring reveals about where the agent market is actually heading.