Just let the model find its own solutions and stop holding its hand.
Interesting concept, I was also thinking that as AI advances we should "stop holding its hand" as you put it.
"Without the initial SFT warmup stage stage, RL training did not achieve desirable results."
Indeed. RL can go beyond imitation, but left to its own devices can go wild. So, adult supervision required, but not helicopter parenting. :)
Interesting concept, I was also thinking that as AI advances we should "stop holding its hand" as you put it.
"Without the initial SFT warmup stage stage, RL training did not achieve desirable results."
Indeed. RL can go beyond imitation, but left to its own devices can go wild. So, adult supervision required, but not helicopter parenting. :)