How ModernBERT is revolutionizing encoder models for LLM applications

ModernBERT combines the powers of encoder-based models with the latest techniques in making transformers more efficient.

Dec 28, 2024

With all the excitement around decoder-based large language models (aka “autoregressive models” or “GPT-style LLMs”), encoder-based Transformers have not received the attention they deserve. Now, ModernBERT, a new encoder model developed by Answer.AI and LightOn, is helping encoders catch up with advances in other LLMs.

ModerBERT integrates some of the research that has gone into improving LLMs to create a very fast encoder with a very large context window and more learning capacity than existing models. ModernBERT can be a powerful tool for optimizing some LLM applications that rely on slow and costly autoregressive models.

Why encoder models?

Like decoders, encoder models take in a sequence of tokens. But instead of generating tokens, they generate an embedding vector that represents the value of the sequence based on the task the model has been trained for. For example, in retrieval-augmented generation (RAG), encoders are used to create embeddings that represent the semantic value of documents. These embeddings are used in the RAG pipeline to match prompts to documents. In classification tasks, these embeddings can be used to, for example, detect whether a prompt or response contains harmful content.

Decoder models can also perform many of these tasks. In fact, developers are building applications on top of decoder-based LLMs such as GPT-4o, Claude, Gemini, and Llama. These models have proven to be very good at various tasks, including generation (text and code), classification, sentiment analysis, and question-answering.

However, they are not the best models for all of these tasks. First, they are compute-intensive, expensive, and slow. On the other hand, encoder models are usually orders of magnitude smaller and much faster than decoders.

And second, since they are designed to generate tokens, they can only look backward at the tokens that came before. In contrast, encoder models like BERT are bi-directional, which means they examine the relationships between tokens in both directions. This enables them to learn richer representations, which is why they perform much better on embedding-based tasks at smaller sizes.

For non-generation tasks, such as classification and similarity measurement, encoder models are more suitable. While an off-the-shelf or API-based LLM can help you create a prototype, when it comes to scaling, you can use encoder models to speed up execution and slash costs at many steps of your prompt pipeline.

ModernBERT

Unfortunately, encoder models have not received the attention they deserve, even though they are an important part of the LLM application ecosystem. ModernBERT fills this gap by adding some of the features of bleeding-edge LLMs into encoder models.

Classic BERT models usually have a 512-token context window. ModernBERT extends it to 8,000 tokens. This makes it possible to work with much longer documents when creating RAG pipelines. ModernBERT’s training data also includes a large amount of code, making it suitable for programming assistance tasks such as code search. It scores above all other open-source encoders in the StackOverflow-QA dataset (SQA), which includes code and natural language.

Regarding performance, ModernBERT pushes the Pareto frontier of encoder models, giving the best combination of accuracy, speed, and memory efficiency. On important encoder-based tasks, ModernBERT is a top scorer across every category. For example, on the GLUE benchmark, ModernBERT beats the leading model (DeBERTaV3) while only using a fifth of the memory. ModernBERT is also twice as fast as DeBERTa in general and up to four times faster in common situations where inputs come in mixed lengths.

ModernBERT has been designed to work on smaller and more affordable GPUs while handling large batch sizes. This makes it useful for edge applications running on laptops and phones.

Architecture changes

To achieve these improvements, the creators of ModernBERT used some of the techniques developed for LLMs in the encoder architecture. For example, they replaced the classic position embedding with rotary position embedding (RoPE), which gives the model more power to encode long sequences while preserving the original information of the tokens.

They also made changes to the layers, including adding normalization layers, removing unneeded biases, and substituting the MLP layers with GeGELU layers, which are more efficient for sequence modeling. ModernBERT has also updated the encoder architecture to support FlashAttention-2, making it faster and more efficient on advanced GPUs.

They also modified the attention mechanism to alternate between global and local attention. This makes the model more efficient at handling long sequences by keeping track of the surrounding context of each token while also integrating important information from the entire sequence.

ModernBERT comes in a base (149 million parameters) and large (395 million parameters) and will be integrated into future versions of the Transformers library.

TechTalks

Discussion about this post