Lessons learned from creating LLMs with million-token context windows
Gradient and Crusoe customized Llama-3 to support 1-million tokens in the context window. Here's how they did it.
Shortly after Meta released Llama-3, enterprise AI startup Gradient created a version of the LLM with a million-token context. (The base Llama-3 model has an 8k context window.)
I recently spoke to a team from Gradient and Crusoe, the AI cloud platform that provided the compute resources, on how they created the models and the lessons learned from the effort.
Some key takeaways:
1- Open research is vital to advances in AI: The team built on the research, techniques, models and code published by researchers across the world, including BAIR (distributed attention), Meta (Llama-3), Nvidia (RULER), as well as universities in China and Singapore.
“A lot of it wouldn’t have been possible without the open research community,” Leo Pekelis, Chief Scientist at Gradient, said. “Open research influences our work across the stack.”
Frontier AI labs and big tech companies are pushing the trend of not sharing research and findings. But hopefully, the open research community will continue to ensure that everyone has access to cutting-edge research.
2- You don't necessarily need the most expensive GPUs to conduct advanced AI work: The team was able to train their models, Llama-3 8B and 70B, on a cluster of L40S GPUs at a fraction of the cost of higher-end GPUs. This was made possible as the Crusoe and Gradient team worked closely on adjusting the GPU cluster for the specific kind of computations that were required.
I think providing customized GPU stacks on the cloud can turn into a niche market that competes with large cloud providers. Small cloud providers can work with enterprise customers to customize their stacks for efficiency and cost reduction.
“The way that we work with partners like Gradient is just to understand where we can provide the most efficient compute across the different types based on what they’re doing. And in this case, L40S was the right answer,” Patrick McGregor, Chief Product Officer at Crusoe, said. “We can provide a huge amount of value in customizing or tailoring different types of compute offerings.”
3- Long-context LLMs unlock new applications and opportunities. Whether it’s style transfer or adjusting the models for new tasks, long-context LLMs will enable companies to quickly create prototypes and full applications without going through the technical complications of techniques such as fine-tuning and RAG. And you can do it without sending your data to GPT-4, Claude, or Gemini.
Read the full interview on VentureBeat.
When conducting large model inference, we need to focus on three aspects:
1. Memory (VRAM): The amount of memory required is determined by the size of the model parameters.
2. Bandwidth: Large model inference is a memory-intensive computation that frequently accesses VRAM, so the bandwidth specifications will affect the inference speed.
3. Quantization: Many models now provide quantized versions in addition to the standard FP16 precision models. Lower precision quantization can save more VRAM and improve bandwidth access speed, which is a common technique used in model inference.
Cloud-based inference optimization also revolves around these three aspects.