From foundation models to foundation agents

What real-world agents can learn from the success of LLMs and VLMs.

Jun 07, 2024

A new position paper by researchers at the University of Chinese Academy of Sciences makes the proposition for “foundation agents” as “generally capable agents across physical and virtual worlds” that will be “the paradigm shift for decision making, akin to LLMs as general-purpose language models to solve linguistic and knowledge-based tasks.”

Foundation agents are supposed to address the challenges of RL- and IL-based control systems, which need to be performed for very narrow tasks and require a lot of human demonstrations or manually labeled data.

The basic premise of foundation agents is that by training models through unsupervised learning on lots of unlabeled data from the real world, we obtain a system that has a lot of encoded world knowledge that can then be focused on specific applications through supervised fine-tuning or prompt engineering. This is a recipe that has proven to be very successful in LLMs and VLMs.

However, contrary to LLM/VLMs, which are mostly designed for generative tasks, foundation models should be able to leverage their knowledge to take actions and accomplish tasks, which presents a unique set of challenges.

Foundation agents leverage the advantages of LLMs/VLMs to create AI systems that are versatile and can solve multiple tasks in open-ended environments, as opposed to classic RL systems that need to undergo training for every single task.

Foundation agents are built on three principles:

1- The model(s) are trained on unlabeled interactive data collected from the real world or simulated environments

2- The agent is customized for downstream tasks either through supervised fine-tuning and/or custom prompts/commands

3- The agent is aligned with human goals/values through its internalized knowledge and instructions from users

Foundation agents can benefit immensely from existing LLM/VLMs or be build from the ground up with a unified data representation system (multi-modal models that include different types of data and actions).

There are challenges that need to be solved, but we're already seeing some promising directions in robotics, where language, vision, and action models are being combined to create robotics systems that can generalize to tasks, environments, and morphologies that go beyond their training data.

Read more about foundation agents on VentureBeat.

Read the paper on Arxiv.

TechTalks

Discussion about this post