New BitNet architecture makes 1-bit LLMs faster, more efficient
The key is to find the right balance between sparsification and quantization
Microsoft Research has announced BitNet a4.8, the successor to its highly touted BitNet model. BitNet is a 1-bit LLM, architectures that dramatically reduce the memory and computational resources required to run the models.
Previous BitNet models used 1.58-bit values (-1, 0, 1) to represent model weights and 8-bit values for activations. This approach significantly reduced memory and I/O costs, but the computational cost of matrix multiplications remained a bottleneck.
BitNet a4.8 further optimizes 1-bit LLMs through what the researchers describe as “hybrid quantization and sparsification.” They designed an architecture that selectively applies quantization or sparsification to different components of the model based on the specific distribution pattern of activations. Sparsification reduces the number of computations by pruning activations with smaller magnitudes. Quantization uses a smaller number of bits to represent activations
BitNet a4.8 uses 4-bit activations for inputs to attention and feed-forward network (FFN) layers. It uses sparsification with 8 bits for intermediate states. The architecture is also optimized to take advantage of existing hardware.
BitNet a4.8 also uses 3-bit values to represent the KV cache, a crucial component of transformer models that stores the representations of previous tokens in the sequence.
Compared to full-precision Llama models, BitNet a4.8 reduces memory usage by a factor of 10 and achieves 4x speedup. Compared to BitNet b1.58, it achieves a 2x speedup through 4-bit activation kernels. But the design can deliver much more if implemented with specialized hardware.
BitNet a4.8 can be an important tool for edge AI applications that require privacy and real-time inference.
Read the paper on arXiv
Read more about BitNet a4.8 and comments from co-author Furu Wei on VentureBeat