What to know about World Labs Marble and where it stands in the world model race
Marble takes in a single image and creates a full 3D world (but there are some catches)
The spatial-intelligence startup World Labs, led by Stanford AI pioneer Fei-Fei Li, has shared an update on its generative model that creates persistent, navigable 3D worlds from a single image and text prompt.
The company's vision is to build "Large World Models" that can perceive, generate, and interact with the 3D physical world, pushing past today's largely 2D, language-first AI.
Through a new beta preview called Marble, users can create and export these environments, placing the technology in a competitive field alongside major research labs like Google DeepMind. The model generates worlds that are bigger, more stylistically diverse, and have cleaner geometry compared to the company's previous results.
How Marble works
World Labs has not disclosed the specific architecture of its model, and the blog post that introduces it contains very little details. You just provide it with your original image and it creates a virtual world based on it.
However, a significant clue lies in its export format. Users can export the generated worlds as Gaussian splats for use in other projects. This suggests the model uses some form of Gaussian Splatting, a modern technique for rendering photorealistic scenes in real time.
At its core, Gaussian Splatting is a rasterization technique. Instead of building scenes from traditional polygons or triangles, it uses millions of 3D gaussians, each defined by its position, scale, color, and transparency. The process typically begins by using a method called Structure from Motion (SfM) to generate a 3D point cloud from a series of 2D images. Each point is then converted into a gaussian. Finally, these gaussians are trained using a process similar to neural network training, where they are repeatedly adjusted, split, or pruned to perfectly match the original images. The result is a highly detailed representation of a scene that can be rendered quickly.
However, what makes Marble stand out is that it takes a single image and it “imagines” the parts that are out of the frame. I was granted early access to the model and tested it on a few images. For example, I gave it the following image of a modern office space and when it created the world, it also rendered tables and conference rooms beyond the original image (you can see the virtual world here). This is what ties in to the “world model” element that World Labs is promising. My guess is that the model creates a latent representation of objects from the image, then expands the surroundings based on the distribution of data it has been trained on, and then generate the entire 3D scene.
Applications and limitations
World Labs' current model is designed to create complete 3D environments rather than focusing on isolated objects. Marble is also limited in generating exterior environments. While less suitable for generating individual characters or animals, it is ideal for building virtual sets and stages. Early users are already exploring its potential for creating game assets and environments for VR filmmaking, with some reporting that tasks that once took weeks can now be completed in minutes.
However, using the model is easier said than done. To make the best use of the model, you have to know what kind of data it has been trained on. For example, the model was pretty good at generating the office from the image above. But when I gave it the illustration of the fantasy tavern below, the generated scene was grainy and buggy (you can see it here), possibly due to the illustration style not being aligned with the kind of data it has been trained on (Marble tends to perform better when given a 3D still image, possibly because it was trained on a lot of 3D renderings). Also, the more you move away from the original image, the less detailed the objects become.
Beyond creative applications, this technology has significant implications for training embodied AI agents. By creating realistic and diverse digital twins of the real world, developers can train and validate robotics and self-driving car models in simulation. Nvidia is already pursuing a similar path by using neural reconstruction and Gaussian-based rendering to turn sensor data from real-world drives into high-fidelity simulations for autonomous vehicle development. These simulations can be used in platforms like the CARLA open-source AV simulator to test new scenarios and generate data for rare corner cases.
What is a world model anyway?
The approach taken by World Labs contrasts with that of competitors like Google DeepMind. World Labs provides a tool to generate an explicit, exportable 3D asset (the Gaussian splat file) which can then be imported into other applications like game engines or simulators.
DeepMind's Genie 3, on the other hand, is an end-to-end generative world model. It uses an auto-regressive architecture to generate and simulate an interactive environment frame by frame in real time, based on text prompts and user actions. In this model, environmental consistency is an emergent property, not the result of a pre-existing 3D structure. The entire world and its interactions exist within a single, dynamic model, without producing a static 3D asset that can be exported. (Genie 3 is currently not available for public use.)
This distinction highlights a broader conversation in the AI community about what a "world model" is. The term is currently used to describe two different concepts. The first, represented by systems like World Labs' Marble and DeepMind's Genie 3, refers to a generative model that can create and simulate an external environment. These models are designed to generate the settings where AI agents can be trained or where users can have interactive experiences.
The second concept of a world model is an internal, predictive system that an AI agent uses to interpret the world around it. This is closer to how humans and animals operate; we don't predict the future at the pixel level but instead rely on abstract representations to anticipate likely outcomes. Models like Meta's Joint Embedding Predictive Architecture (JEPA) are designed for this purpose. They learn the latent features that govern interactions in the world, allowing an agent to make efficient predictions and take actions without needing a complete, photorealistic simulation. My bet is that future of embodied AI will likely depend on a combination of both approaches: generative models like Marble will create vast and complex virtual worlds to train agents that are equipped with efficient, predictive world models like V-JEPA.
Fascinating!