How LLMs can self-improve on instruction-following

Meta and NYU have released "self-rewarding language models" a technique that enables LLMs to self-improve for instruction-following.

Jan 29, 2024

A new research paper by Meta and New York University introduces “Self-Rewarding Language Models,” a technique that enables LLMs to self-improve on instruction-following tasks.

The main idea of SRLM is to enable the LLM to self-improve by creating and evaluating its own training data.

SRLM starts with a base mode and a seed dataset for instruction-fine tuning. It then uses that dataset to create new examples and candidate responses. Finally, it uses a special prompt to rank those responses.

It uses the newly generated examples to augment its training dataset, and then undergoes another round of training, which again improves its performance. This process can be repeated iteratively to continue improving the model.

Experiments show that SRLM improves the model in instruction-following and reward modeling. The model outperforms some of the closed state-of-the-art models on the AlpacaEval benchmark.

Read all about Self-Rewarding Language Models on TechTalks.

Read the original paper here.

Recommendations:

To test different prompts with GPT-3.5/4 and Claude, I use ForeFront, a platform that provides access to state-of-the-art models in a very flexible pricing format.
LLM Cloud by Predera provides a full suite of tools for building applications with large language models. This includes a playground for testing and comparing different private models (GPT-4, Claude, Gemini, Command) and open-source models (Llama 2, Mistral, Mixtral). You also get a unified API platform that makes it easier to switch between models without changing your code. And you have a dashboard that gives a great overview of costs of using each model in your application.
If, like me, your work involves a lot of reading, consider using Speechify, a tool that reads text for you out loud. It improves your concentration, reduces eye strain, and improves productivity. The voice quality is exceptional.

For more on AI research:

TechTalks

How LLMs can self-improve on instruction-following

Meta and NYU have released "self-rewarding language models" a technique that enables LLMs to self-improve for instruction-following.

Discussion about this post