A generative world model for humanoid robots
1X trained its generative model on thousands of hours of video and
The classic way to develop robotics control models is to first train the models in a simulated environment such as Isaac or Bullet and then deploy them on the real robots. The problem with this approach is the differences between the simulation and the real world, referred to as the “sim2real gap.” These differences can range from visual differences (CGI vs real images) to physical differences. The sim2real gap reduces the model’s performance when deployed on the real robot.
To solve this problem, robotics startup 1X has developed a new generative model that replaces simulation engines. The model has been trained on thousands of hours of video and sensory data collected from the company’s EVE robots. It has learned a “world model” that can predict the next frame of a video and robotic action sequences.
The work is inspired by video generation models such as Sora. However, this is an interactive model that can respond to actions, which makes it suitable for training robotics models. Its image quality is photorealistic, which reduces the sim2real gap considerably in comparison to classic simulation engines. It also shows impressive accuracy in simulating rigid and deformable objects. The model can also predict long horizon tasks such as folding shirts.
The model still has challenges to meet. For example, the team has to make sure that the world model stays up-to-date with the changing world. For this, the team will have to continue to fine-tune the model with fresh videos collected from their robots. But they believe that since the model has already learned the core representations of the world, updating it with new data will not be very difficult.
A more fundamental problem is the hallucinations that the model sometimes makes. For example, the model might miss simple details such as predicting that a dish will fall if it is left in the air.
For the moment, the team’s plan is to continue scaling the data and compute used to train the models to improve their accuracy. But the company will also run challenges and competitions to get help from the community. The company has open sourced two models based on Llama and GENIE.
Read more about the world model along with remarks from the VP of AI at 1X on VentureBeat
Read more about the model on the 1X blog
See the models and weights on GitHub and Hugging Face