Transfusion models handle both discrete and continuous data
Meta's Transfusion architecture natively supports images and text without the tradeoffs of other models.
Meta has introduced its new multi-modal architecture Transfusion. This is not the first multi-modal model, but what makes it special is its native support for both discrete (e.g., text) and continuous (e.g., images) data.
Other multi-modal models bring different modalities together using tricks that reduce the quality of representing the data. For example, models such as LLaVA handle images and text through different models. This separation makes it impossible for the model to learn rich representations, such as when images and text are interleaved.
Other models such as Meta’s Chameleon support multi-modality natively. But to unify different modalities, they quantize image chunks into discrete tokens, resulting in information loss.
Chunting Zhou, Senior Research Scientist at Meta AI and co-author of the Transfusion paper, previously worked on the Chameleon paper. I spoke to her in my latest article for VentureBeat. The idea for Transfusion came out of the experience with Chameleon and its limits.
Transfusion is a recipe for training a single model that can handle both discrete and continuous modalities without the need for quantization or separate modules. The core idea behind Transfusion is to train a single model with two objectives: language modeling for text and diffusion for images.
During training, the model is exposed to both text and image data, and the loss functions for language modeling and diffusion are applied simultaneously.
The model uses VAEs to chunk input images into 8x8 patches and to represent them through continuous vectors.
The researchers trained a 7-billion model based on Transfusion and evaluated it on a variety of standard uni-modal and cross-modal benchmarks, including text-to-text, text-to-image, and image-to-text tasks. Transfusion consistently outperformed the Chameleon across all modalities and used less compute resources.
The 7B Transfusion can also be trained to generate images at levels that outperform DALL-E 2 and SDXL.
The Transfusion architecture can unlock many new applications, according to the researchers.
Read more about Transfusion and comments from the author on VentureBeat
Read the paper on arXiv




