Run multiple fine-tuned LLMs for the price of one
Fine-tuning is one of the very valuable use cases of open-source LLMs like Llama and Falcon. However, fine-tuning can come with heavy memory and compute overhead, especially if you want to have multiple models for downstream tasks.
One solution is to use techniques such as low-rank adaptation (LoRA), which reduces the overhead of fine-tuning LLMs. And you can use LoRAX, a framework for training and running LoRA models at scale.
Key findings:
Classic deep learning fine-tuning requires permanent changes to the original model, which can be costly and complex
LoRA trains a separate model that encompasses a small subset of the parameters of the original LLM
LoRA can also be plugged into the main LLM at inference time, which makes it possible to preserve the original model
LoRAX is a framework by Predibase, which allows you to manage a large array of LoRA adapters and use them during runtime for different applications
LoRAX has advanced features, including a multi-layered caching system to manage your GPU memory when you have many LoRA adapters
Read the full article on TechTalks.
Recommendations:
My go-to platform for working with ChatGPT, GPT-4, and Claude is ForeFront.ai, which has a super-flexible pricing plan and plenty of good features for writing and coding.
Transformers for Natural Language Processing is an excellent introduction to the technology underlying LLMs. It provides a very accessible explanation of how transformers work and how you can use different transformer architectures (BERT, T5, GPT, etc.)
More tips and trips with LLMs: