Meta's Chameleon uses new approach to create multimodal LLMs

Chameleon uses "early fusion" to improve multimodality and unlock new applications

May 23, 2024

Meta has just announced Chameleon, its new multimodal language model. Chameleon's main key feature is "early fusion." Most multimodal LLMs have separate components to process each modality (text, image, code, etc.). They encode those inputs independently and then concatenate them— also called “late fusion”—before doing inference on them.

In contrast, Chameleon uses an end-to-end approach that tokenizes and processes text, code, images, and other modalities as they are, tokenizes and encodes them together, performs inference, and returns a mixed-modality response. This unique architecture enables Chameleon to learn richer representations than other models.

For example, if you have a document with text and images interleaved, the position of the text and images makes a huge difference in their meaning and the model’s response. A classic late-fusion model would not pick up these nuances while an early-fusion model would be able to learn the right representations.

Meta has not yet released the model weights but they have reportedly trained 7-billion and 34-billion parameter versions on around 4.4 trillion tokens (2.1 epochs) on more than 5 million GPU hours.

Their reported experiments show that Chameleon has SOTA performance tasks such as image captioning and visual question-answering (VQA). At the same time, it manages to maintain its efficiency in text-only tasks. Moreover, Chameleon unlocks new capabilities for applications that require the processing and generation of interleaved data modalities.

Examples of tasks Chameleon can accomplish

Early fusion can set new directions for future research of AI and potentially create new consumer and enterprise applications. It will also be interesting to see how Chameleon—if Meta releases it—will affect the competitive LLM market, where competition has shifted to seamless multimodal experiences. An open model could commoditize the market and make multimodal LLMs available to more organizations.

Read the paper here.

Read my review on VentureBeat here.

TechTalks

Discussion about this post