AI Alignment Evals Hackathon
The Problem:
We want our AIs to have certain values in robust and scalable ways. There are methods proposed to get the values into the models - however it's highly ambiguous how well they actually work. Let's remove that ambiguity.
Join us for a week-long hackathon focused on advancing AI alignment evaluation methods! This event brings together practitioners of all experience levels to tackle crucial challenges in AI safety. The event will take place on the AI-Plans Discord: https://discord.gg/NmW6MdPxFy
Why Participate?
- Learn: Master cutting-edge alignment evaluation techniques and frameworks
- Build: Create tools that shape the future of AI safety measurements
- Connect: Join a community of AI safety practitioners and experts
- Impact: Contribute to crucial work in AI alignment evaluation
Quick Details
- Where: AI-Plans Discord: https://discord.gg/NmW6MdPxFy
- When: January 25 - February 1st, 2024
- Final Submissions Due: February 3rd, 2024
- Prerequisites: Basic Python experience and familiarity with neural networks
- Contact: kabir@ai-plans.com
Who can take part?
All you need is:
- Basic Python programming experience
- High-level familiarity with neural network training concepts
You can learn the rest in the hackathon!
What we'll do:
Design robust alignment evaluation methods, taking ideas from hhh, SALAD, and ChiSafety and our in house benchmarks.
Schedule
January 25th: Kick Off
February 2nd: Presentation Evening
Teams will present what they've worked on. All are welcome, not just those who took part in the hackathon.
Resources Provided
Technical Tools
- API access to alignment plans and research
- Fine-tuned Qwen 0.5b versions (RLAIF, DPO, IPO, SFT)
- Broad Vulnerabilities documentation
- Tutorials on:
Cross Coders
what they are, relevance for alignment evals, how to train them, how to use them to identify features, etc
Alignment Evals (HHH, ChiSafety, SALAD, etc)
what it is, how to run an eval, things to consider when making one, how to make one, etc
Inspect - a framework that makes creating and using Evals a lot easier
- Mentor access
Host a Local Event
Want to organize a local gathering? Register here: https://tally.so/r/wo0xRx
---
Join us in making meaningful contributions to AI alignment research while developing valuable skills and connections in the field.
Blue Teams:
Making Benchmarks for AI Alignment, to test how well the values are actually in the model.
Potential Ideas:
Use Cross Coders to find the difference between Post Trained models and Base models, see if the desired safety features have actually been made
Making a multilingual version of benchmarks like HHH, SALAD, ChiSafety, etc., including speech patterns not commonly tested for
Set up a dataset of values for post training, have the post trained model make another version, post train a new model on that version, iterate, see how safety benchmark scores change over iterations.
Resources:
Learning materials for alignment evals (no need to read all of them, just find what you need):
https://www.alignmentforum.org/posts/2PiawPFJeyCQGcwXG/a-starter-guide-for-evals
https://inspect.ai-safety-institute.org.uk/