Anthropic study sheds light on the vulnerabilities of LLM supply chains
A new study by Anthropic shows that LLMs can have hidden backdoors that can't be removed with safety training.
A new study by Anthropic shows that malicious actors can hide backdoors in LLMs in a way that can’t be removed with safety training.
The attackers condition the model to provide normal output until provided with a sequence of tokens that triggers the harmful behavior. According to their findings, even if the model is fine-tuned for safety, the malicious behavior remains.
They suggest that current defense methods are not sufficient to deal with the threat of hidden backdoors.
It is worth noting that Anthropic, whose business model is built on selling API access to private models, has a vested interest in highlighting the threat of open-source LLMs. But the findings align with the broader security problems of the deep learning landscape.
Given the complexities and costs of training deep learning models, especially LLMs, developers tend to use off-the-shelf pretrained models from platforms such as Hugging Face. But there is no guarantee that these models have not been trained on malicious examples. And given the nascent state of deep learning security, the tools to detect and block malicious behavior are still in early stage.
Read all about the LLM backdoors on TechTalks.
Recommendations:
To test different prompts with GPT-3.5/4 and Claude, I use ForeFront, a platform that provides access to state-of-the-art models in a very flexible pricing format. ForeFront also has a Workflow feature that enables you to create no-code multi-step logic with your models.
If, like me, your work involves a lot of reading, consider using Speechify, a tool that reads text for you out loud. It improves your concentration, reduces eye strain, and improves productivity. The voice quality is exceptional.
For more on AI security: