Learning commonsense physics through self-supervised learning

Recent research by Meta shows ML models can understand intuitive physics by watching videos.

Apr 29, 2025

A robot watching videos on a large display

Humans have an innate understanding of how the world works. We expect a dropped ball to fall, objects to persist even when hidden, and solid things not to pass through each other. This "intuitive physics" is fundamental to our cognition.

Yet, replicating this common sense in artificial intelligence remains a significant challenge. Now, a recent study by researchers at Meta AI demonstrates how a specific type of deep learning model can develop an understanding of intuitive physics simply by watching vast amounts of unlabeled video data.

This work offers valuable insights into building better world models, a crucial step towards more capable and general-purpose AI.

Intuitive physics and the AI challenge

Intuitive physics is our basic grasp of how the physical world works. We expect objects to behave predictably—they don't suddenly appear or disappear, move through solid barriers, or arbitrarily change their shape or color. This understanding develops early in humans and even exists in many animal species.

Despite rapid advancements in solving complex tasks like coding, mathematics, and language generation, current AI systems struggle with common-sense physical reasoning. This illustrates a persistent gap often referred to as “Moravec’s paradox”: tasks trivial for biological organisms can be remarkably difficult for AI.

There are two main approaches to imbue AI with physical understanding. Structured models often use hand-coded representations of objects, their properties, and their relationships within a 3D space, essentially building a "game engine" in the AI's mind to simulate physics. This aligns somewhat with theories suggesting humans have innate "core knowledge" systems. On the opposite end are pixel-based generative models. These systems take a more general approach, learning by trying to predict future video frames directly at the pixel level based on past frames, without any pre-defined structure about objects or physics.

V-JEPA: A middle ground for learning physics

The Meta AI paper explores a third approach that finds a middle ground: Joint Embedding Predictive Architectures (JEPAs). JEPA was first introduced in 2022 by Meta’s chief AI scientist Yann LeCun (also a co-author in the new paper). The core idea behind JEPAs is that predicting future world states should happen in an abstract, internal representation learned by the model itself, rather than directly predicting low-level features or relying on hand-coded structures. Unlike structured models, JEPAs learn their own representations from data.

The study focuses on a video version of this architecture, called V-JEPA. This model learns about the world by watching videos and predicting missing parts. Crucially, instead of predicting scenes at the pixel level, V-JEPA works within its learned abstract representation space, such as how an object should interact with its environment and other objects.

At a high level, V-JEPA consists of two main components: an encoder and a predictor. The encoder analyzes a video and extracts abstract representations of its content. During training, parts of the input video are artificially masked (e.g., random blocks in space and time, or future frames). The predictor's job is then to predict the representation of these missing parts, based on the visible parts provided by the encoder.

Through this process, the encoder learns to capture the essential, predictable information about the video's content and dynamics, while discarding irrelevant low-level details.

The main benefit of this training method is that it is self-supervised, which means that it does not require humans to label the video frames.

Probing V-JEPA's understanding of the world

Once V-JEPA is trained on large amounts of video data, its learned encoder and predictor can be used to probe its understanding of physics without any further training or fine-tuning.

The researchers used a method inspired by developmental psychology called the "violation-of-expectation" paradigm. In human infant studies, researchers show babies two scenarios: one physically plausible and one impossible (e.g., an object seemingly passing through a solid wall). Increased looking time at the impossible event is interpreted as "surprise," indicating the infant understands the physical principle being violated.

Similarly, the AI model can be shown pairs of videos – one physically possible, one impossible. As the paper states: "By prompting the model to imagine the (representation of the) future of a video and comparing its predictions with the actual observed future of the video, we obtain a quantitative measure of surprise that can be used to detect violations of intuitive physics concepts."

A higher surprise score for the impossible video indicates the model has learned the relevant physical principle.

How V-JEPA performs compared to other models

The researchers tested V-JEPA's intuitive physics understanding using three benchmark datasets that include videos designed to test specific concepts like object permanence (objects continue to exist when hidden), continuity (objects move along connected paths), shape and color constancy, solidity (objects don't pass through each other), gravity, support, and inertia.

They compared V-JEPA against other classes of models: a representative pixel-prediction model (VideoMAEv2) and state-of-the-art Multimodal Large Language Models (MLLMs like Qwen2-VL and Gemini 1.5 pro) that reason about videos through text.

The results were striking. V-JEPA consistently and accurately distinguished between physically plausible and implausible videos across all datasets, achieving high accuracy (e.g., 98% on IntPhys). In contrast, both the pixel-prediction model and the MLLMs performed much closer to random chance.

"These results show that prediction in a learned representation space is sufficient to develop an understanding of intuitive physics," the authors conclude. "This is done without any predefined abstractions, and without knowledge of the benchmarks during pretraining or development of the method."

The researchers stress that these findings "do not mean that LLMs or pixel prediction models cannot achieve intuitive physics understanding, but merely that this seemingly simple task remains difficult even for frontier models."

Why V-JEPA succeeds

The study explores how different design choices affect V-JEPA’s grasp of intuitive physics.

The researchers found that the specific masking strategy during training wasn't the most critical factor. Even simple random masking worked reasonably well. The key element seems to be performing the prediction task within an abstract representation space, rather than predicting raw pixels.

In terms of data, the type of video data used for training matters. Models trained on datasets focused primarily on motion performed poorly. Training on action-centric datasets yielded above-chance results. The best performance came from training on tutorial videos, even when using only a small fraction (e.g., 128 hours of unique video, equivalent to about one week) of the full dataset.

Model size was also important, and as is common in deep learning, larger models generally performed better. However, the ability to learn intuitive physics wasn't exclusive to massive models. A relatively small V-JEPA model (115 million parameters) still achieved impressive accuracy (over 85%), demonstrating the robustness of the approach.

Limitations and the path forward

Despite its success, V-JEPA isn't perfect. It struggles with physics concepts that require understanding a specific contextualizing event shown earlier (like knowing if a container has a false bottom before seeing an object dropped into it) or modeling precise interactions like collisions. Current models also lack the ability to condition their predictions on external factors, like an ongoing action. They predict the future purely as passive observers.

Future research could explore training these models on video data specifically curated to mimic what human infants see, potentially shedding light on how early visual experiences shape physical understanding.

The researchers are optimistic about the approach and write, "We believe that the latent prediction framework is a path forward toward building neural networks that understand the physical world."

TechTalks

Discussion about this post