Aligning LLMs: RLHF
Fine-tuning and alignment are often misunderstood terms regarding Large Language Models (LLMs). In this series on Aligning LLMs, we will cover the most popular fine-tuning alignment methods, as well as emerging techniques, namely:
Reinforcement Learning with Human Feedback (RLHF)
Reinforcement Learning with AI Feedback (RLAIF)
Direct Preference Optimization (DPO)
Reasoning with Reinforced Fine-Tuning (ReFT)
In our first event, we tackle RLHF, an often glossed-over topic for most AI Engineers. By the end of the session, we aim to give you some deep intuition into how models like InstructGPT and Llama 2 leveraged human feedback to align with us on what it means to be “helpful, honest, and harmless.”
We’ll cover where RLHF fits in within the context of training LLMs. Generally, this process starts with unsupervised pre-training, followed by supervised fine-tuning before RLHF is used for a final polish. RLHF breaks down into three simple steps, each covered in detail.
The first is to start with a pre-trained-based model and fine-tune it to respond well to many types of instructions; in other words, to instruct-tune it to increase its helpfulness.
The next step is to train a reward (a.k.a. preference) model specifically used to represent human preferences, which requires multiple responses for a diverse set of prompts, each order-ranked by human labelers. In this event, we will demonstrate how to train a reward model with two choices per prompt, given by [chosen, rejected] prompt-response pairs.
The third and final step is to fine-tune using Reinforcement Learning (RL). Note that we will map RL vocabularies like policy, action space, observation space, and reward function directly to our LLM alignment problem during the event!
Finally, we’ll discuss the limitations of RLHF, which will motivate the continuation of our series on alignment!
We will begin our code demonstrations with a fine-tuned version of Mistral-7B-v0.1 called Zephyr-7B-Alpha, which has already been tuned for helpfulness. Then we’ll train a BERT-style rewards model distilroberta-base for sequence classification using the helpful-harmless dataset from Anthropic.
Finally, we’ll optimize generations from our Zephyr model using our rewards model by leveraging the real-toxicity-prompts dataset from the Allen Institute for AI (AI2).
We will perform all steps in a Google Colab notebook environment, and all code will be provided directly to attendees!
Join us live to learn:
The role that RLHF plays in aligning base LLMs toward being helpful and harmless.
How to choose reference (policy) and reward models and datasets for training them.
How RL, through Proximal Policy Optimization (PPO), fine-tunes the initial LLM!
Speakers:
Dr. Greg Loughnane is the Co-Founder & CEO of AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs. Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher. He loves trail running and is based in Dayton, Ohio.
Chris Alexiuk is the Co-Founder & CTO at AI Makerspace, where he serves as an instructor for their AI Engineering Bootcamp. Previously, he’s held roles as a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.
Follow AI Makerspace on LinkedIn & YouTube to stay updated with workshops, new courses, and opportunities for corporate training.