Thought Preference Optimization (TPO) teaches LLMs to generate logical thoughts before responding to queries.
New training method gives LLMs o1-like…
Thought Preference Optimization (TPO) teaches LLMs to generate logical thoughts before responding to queries.