What happens inside LLMs when they hallucinate?
A study shows that the activation patterns of LLMs can indicate when they are making errors.
There has been a lot of research on LLM hallucinations, but much of it is focused on the external behavior of models. A new study by Technion, Google Research, and Apple looks into the inner workings of LLMs, and it turns out that activations reveal much about truthfulness and hallucinations.
This is not the first study that takes an inside look at LLM hallucinations. But previous studies mostly focus on the final token generated by the model. The new paper analyzes “exact answer tokens,” the response tokens that, if modified, would change the correctness of the answer.
Their experiments on long-form answers generated by different LLMs on various tasks show that truthfulness information is concentrated in the exact answer tokens.
Based on these findings, they trained classifier models, which they call "probing classifiers," to map LLMs’ internal activations to truthfulness. The researchers found that training classifiers on exact answer tokens significantly improves error detection.
There are limitations to this technique, however. For example, the researchers found that a classifier trained on activations from one dataset does not generalize to other datasets. But trained classifiers do exhibit "skill-specific" truthfulness, meaning they can generalize within tasks that require similar skills, such as factual retrieval or common-sense reasoning.
They also found that probing classifiers could predict not only the presence of errors but also the types of errors the model is likely to make.
Another interesting finding is that even when the model hallucinates, its internal activations contain the necessary signals for generating the correct answer. This could guide future error-correction techniques that can steer models to find the right knowledge in their memory when generating answers.
It is worth noting that while this technique is very useful, it requires white-box access to the model, which is only feasible with open-weight models. The insights gained from analyzing internal activations can help develop more effective error detection and mitigation techniques.
Read more about the study on VentureBeat.
Read the paper on arXiv.
Look out for the code, soon to be released on GitHub.