Cost-Effective Way to Optimize Learning from Human Feedback in LLMs
ChatGPT has turned the AI industry upside down, making AI mainstream and starting a new AI revolution. One of the crucial components of ChatGPT that contributed to its high performance was Reinforcement Learning from Human Feedback (RLHF). RLHF is used to ensure AI alignment and enhance the performance of modern large language models (LLMs). The Proximal Policy Optimization (PPO) algorithm, used for RLHF, has been recognized by the AI community as the canonical method for the Reinforcement Learning part of RLHF.
However, it involves both high computational costs and sensitive hyperparameter tuning. Researchers from Cohere suggest that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF contexts.
They show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed 'RL-free' methods such as DPO and RAFT. Their work suggests that careful adaptation to LLM alignment characteristics allows for benefiting from online RL optimization at a low cost.
Our guest, Arash Ahmadian from the Reinforcement Learning and Interaction Research team at Cohere, will share with the BuzzRobot community the technical details behind the new optimization method for learning human preferences for Large Language Models