Although it has been late to the game, Apple has been making some interesting moves in the LLM space.
A recent research paper by researchers at Apple introduces a technique that can run LLMs on edge devices with memory constraints.
The technique uses several novel memory management and weight storage techniques to reduce the amount of DRAM it occupies. The technique stores the model in flash memory and dynamically loads part of it into DRAM.
However, throughput from flash memory is at least an order of magnitude slower than DRAM. To overcome this limit, the researchers introduce several techniques that reduce latency by up to a factor of 25.
This can be an important development for the future, especially as Apple is interested in bringing AI features directly to your phone and computer.
Read more about Apple’s LLM in flash technique on TechTalks.
For more on AI research: