Why you can't trust a hallucinated world for real-world training

Beyond creating a new era of interactive games, can these expensive, hallucination-prone models ever be trusted to train reliable robots for the physical world?

Aug 15, 2025

Google DeepMind’s Genie 3 generates interactive, playable worlds from a single text prompt. The model renders these dynamic environments in real time at 720p resolution and 24 frames per second, a significant step beyond passive video generation toward controllable simulation. Its auto-regressive architecture builds each new frame by considering the previous frames and the user’s last action, allowing environmental consistency to emerge without an explicit 3D model.

While this technology opens the door to infinite, explorable worlds, it also highlights the fundamental challenges facing purely generative, end-to-end world models.

The ambition behind Genie 3 is to build what DeepMind CEO Demis Hassabis calls a "world model," a system that understands the physical world. In an interview with Google’s Logan Kilpatrick, he explained that if an AI is to become truly general, "it clearly needs to understand the physical world."

A world model must grasp not just static objects but also physics, the properties of materials like liquids, and the behaviors of living things. One of the best ways to test the depth of such a model, according to Hassabis, is to have it generate a consistent and believable world from scratch.

This capability promises two major applications. The first is as a research tool for training other AIs. DeepMind is already using Genie 3 to generate environments for its SIMA agent, creating a scenario where, as Hassabis describes, "you've got basically one AI playing in the mind of another AI." This could provide unlimited training data for robotics. The second application lies in creating new forms of entertainment. Hassabis envisions "next-generation, incredible games" and a new genre that sits somewhere between a film and a game.

Despite these advances, generative models exhibit what Hassabis calls "jagged intelligence." For example, similar models can achieve superhuman feats, like earning a gold medal in the International Mathematical Olympiad, yet fail at seemingly simple tasks. Hassabis notes they "can still make simple mistakes in high school maths, simple logic problems, or simple games if they're posed in a certain way." This uneven performance reveals deep-seated limitations that currently prevent their use in many applications.

One of the most significant issues is the "reliability paradox." Genie 3 can sometimes suffer from physics inaccuracies and visual hallucinations, such as people appearing to walk backward. If these simulated worlds are not physically reliable, they cannot be trusted to train an agent for correct performance in the real world. The problem is foundational: in a generative model, the world and the perception of it are one and the same. Unlike a human who perceives an external reality and can correct their internal model, a generative world has no ground truth to check against. When it hallucinates, it builds upon that error, creating a cascade of inconsistencies.

The model also faces a scalability wall related to memory. Genie 3 maintains consistency for a few minutes, a major improvement over its predecessor but far from the persistence needed for complex simulations. To keep a world coherent, the model must remember everything that has happened, including events that occur off-camera and out of view. The computational requirements to extend this memory to hours or days are immense, and it is an open question whether the current architecture can overcome this hurdle.

These challenges suggest that a more pragmatic, modular approach may be more practical for now. This involves using different tools for specific jobs: a generative model to create the world's structure in an established format like OpenUSD, a dedicated physics engine like Isaac Lab for simulation, and a game engine to track object states (you can optionally add another deep learning model, such as a variational autoencoder, to add a layer of photorealism to the images generated by the game engine). Another agentic framework can control the behavior of other AI characters that share the virtual world with the main character. Finally, the agent that is navigating the world does not need to do pixel-level predictions of its environment and can use much more efficient architectures such as Meta’s V-JEPA.

This approach provides a much more structured, predictable, and controllable solution and contrasts with packing everything into a single model where one error can propagate.

However, the jury is still out on which philosophy will prevail. The history of AI has shown that scaled, end-to-end models like LLMs can eventually outperform complex, engineered pipelines. Hassabis himself points to a future of convergence with what he calls an "Omni-model," a single system capable of handling video, language, and world simulation. For now, Genie 3 is a flawed but powerful demonstration of what is possible. While modular systems appear more reliable today, the race to build the ultimate simulator is far from over.

Interesting Engineering ++

Aug 16

🧠 Ben—Quick Thoughts on Your "healthy, with skeptism" Hallucinated World Piece🙏👍👏, which is enjoyed...

1. Hallucinations ≠ Randomness

You’re right—models like Genie 3 hallucinate. But DeepMind sees generation as a test of understanding. If a model can keep objects consistent across frames, that’s not noise—it’s structure. I take these as "evolutionary phases" - things will improve over time. I take more comfort from Demis' views (vs Sam - sorry to say), given DeepMind's track record. Also impressed with their research -> product integration off late.

Veo’s fluid dynamics and AlphaFold’s protein predictions suggest these models are picking up real patterns, even without direct supervision.

2. Memory Wall Is Real—but Cracking

Genie 3 struggles with long-term coherence. No argument there. But DeepMind’s quietly optimizing across training and architecture (e.g. AlphaEvolve). Video generation’s progress—from seconds to minutes—shows the wall’s eroding, bit by bit.

Still a journey, I guess. Destination unknown, but hopefully.😊🤭

3. Modular vs. Omni-Model Isn’t Binary

Modular systems are easier to trust, yup. But Hassabis is betting on fusion—language, video, simulation in one model. Gemini’s tool use hints at a hybrid future: structured scaffolding with generative reasoning inside. Pieces of a complex puzzles (one day, maybe) with flawless integration, as jagged as the edges may presently appear.

4. AI Winter Risk Is Legit—but Context Matters

You flagged the risk of overpromising. Fair. But DeepMind’s track record—AlphaGo, AlphaFold...their Alpha Model ranges — are being built on solving real problems. Hassabis isn’t selling certainty, just possibility. That’s a key difference.

👀 Ummmm

Hallucinated worlds may not be fully ready for deployment—but they’re not just fantasy either. They’re starting to show internal logic. Maybe not 100% trustworthy yet, but definitely worth watching.

I also belief hallucination is a feature we shudnt exclude completely - or we risk discarding our venture with Novelty. Just a bit more "fine tuning" required and one day that balance coyld be productively applied...

Expand full comment

TechTalks

Discussion about this post