Custom embedding models with small training dataset
Embedding models are one of the key components of many LLM applications, including retrieval augmented generation (RAG). However, for some applications, you need custom embedding models to get good data representations that match user prompts to relevant documents.
Currently, training embedding models is a complex and expensive process that involves two stages. First, the model is pre-trained on a large weakly-labeled dataset through contrastive learning. Then it must be fine-tuned on a small but high-quality labeled dataset.
A new technique by Microsoft researchers provides a new solution that is both simpler and less expensive. The researchers use proprietary LLMs like GPT-4 to generate training examples for retrieval tasks and other types of embedding applications. They then use the data to fine-tune an autoregressive model instead of a bidirectional encoder, which is the norm.
The result is an embedding model that does not require the expensive pre-training stage and only needs fine-tuning on the small dataset. It is cost-effective and can make it possible for many companies to create their own custom embedding models.
To learn more about the technique read the article on TechTalks.
Recommendations:
To test different prompts with GPT-3.5/4 and Claude, I use ForeFront, a platform that provides access to state-of-the-art models in a very flexible pricing format. ForeFront also has a Workflow feature that enables you to create no-code multi-step logic with your models. I used it to try the prompt creation pipeline presented in this article.
For more on AI research: