Why Microsoft's latest coding LLM can be a big deal
Train an LLM for coding tasks with only 20k examples
Microsoft researchers recently released a paper that introduces WaveCoder, a coding LLM that matches and outperforms other models of similar size.
What makes WaveCoder impressive, however, is the process used to train it. One of the important challenges of training coding LLMs is finding the right balance between training costs and model quality.
The researchers hypothesized that if you can gather a small dataset that contains diverse examples, you can get the biggest bang for the buck. They created a pipeline that filters a very large dataset for the most representative examples, then generates coding instruction training examples from them.
They created CodeOcean, a dataset with 20,000 training examples. Their experiments show that WaveCoder, trained on CodeOcean, outperforms other models of similar size, even if their training dataset was larger.
The importance of this technique is that it can considerably reduce the costs of training coding LLM and make the process much more cost-efficient.
Read all about WaveCoder and CodeOcean on TehcTalks.
For more on AI research: