How far can you trust Claude Computer Use?

Good for fast prototyping, not very reliable for repeatable outcomes.

Dec 08, 2024

A recent study by Show Lab at the National University of Singapore gives a glimpse of the capabilities of the generation of graphical user interface (GUI) agents. The study comes on the heels of the release of Computer Use by Anthropic, a feature that allows its Claude chatbot to see the content of the computer’s display and use the mouse and keyboard to control the operating system.

The feature promises to enable users to automate tasks through simple instructions and without the need to have API access to applications.

The researchers tested Claude on a variety of tasks including web search (use a browser to find information and accomplish tasks such as purchasing items), workflow completion (multi-app interactions such as extracting information from the web and inserting it into a file), office productivity (format documents, modify content, etc.), and video games.

The framework the researchers propose has three elements. First, a plan element determines the steps needed to accomplish the task. Then an act element determines the actions needed to carry out the plan. Finally, a critic determines the model can evaluate its progress and success in accomplishing the task and determine if the task is impossible.

The researchers carried out the test on Claude Computer Use and manually reviewed the results and reasoning traces. Their findings show that in general, Claude does a great job in accomplishing complex tasks. It can reason about the task, plan its actions, and carry them out properly. It was also able to perform tasks that required switching between different applications. It also did a decent job of reviewing its actions and making sure it is making progress toward the goal.

However, like other AI systems, it sometimes makes trivial mistakes that human users can easily avoid. For example, if a button needed to accomplish the task is below the fold, the agent might not scroll down and end up saying that the task is not possible.

The findings show that while GUI agents are promising, they also have their limits. Their inner workings remain a mystery and can result in unexpected outcomes. This can also make them vulnerable to adversarial attacks.

Nonetheless, tools like Claude Computer Use can be very useful for fast prototyping, especially when testing proof of concepts that involve several applications that don’t have ready-made APIs to use. But when it comes to scale and reliability, we still need good engineering.

Read the full study on arXiv

TechTalks

Discussion about this post