Cover Image for AI Alignment Evals Hackathon
Cover Image for AI Alignment Evals Hackathon
Avatar for AI-Plans
Presented by
AI-Plans
34 Going

AI Alignment Evals Hackathon

Virtual
Get Tickets
Welcome! Please choose your desired ticket type:
About Event

The Problem:

We want our AIs to have certain values in robust and scalable ways. There are methods proposed to get the values into the models - however it's highly ambiguous how well they actually work. Let's remove that ambiguity. 

Join us for a week-long hackathon focused on advancing AI alignment evaluation methods! This event brings together practitioners of all experience levels to tackle crucial challenges in AI safety. The event will take place on the AI-Plans Discord: https://discord.gg/NmW6MdPxFy

​Why Participate?

​- Learn: Master cutting-edge alignment evaluation techniques and frameworks

​- Build: Create tools that shape the future of AI safety measurements

​- Connect: Join a community of AI safety practitioners and experts

​- Impact: Contribute to crucial work in AI alignment evaluation

​Quick Details

​- Where: AI-Plans Discord: https://discord.gg/NmW6MdPxFy 

​- When: January 25 - February 1st, 2024

​- Final Submissions Due: February 3rd, 2024

​- Prerequisites: Basic Python experience and familiarity with neural networks

​- Contact: kabir@ai-plans.com

​Who can take part?

​All you need is:

​- Basic Python programming experience

​- High-level familiarity with neural network training concepts

​You can learn the rest in the hackathon!

​What we'll do:

​Design robust alignment evaluation methods, taking ideas from hhh, SALAD, and ChiSafety and our in house benchmarks.

​Schedule

​January 25th: Kick Off

​February 2nd: Presentation Evening

​Teams will present what they've worked on. All are welcome, not just those who took part in the hackathon.

​Resources Provided

​Technical Tools

​- API access to alignment plans and research

​- Fine-tuned Qwen 0.5b versions (RLAIF, DPO, IPO, SFT)

​- Broad Vulnerabilities documentation

​- Tutorials on:

  • ​Cross Coders

    • ​what they are, relevance for alignment evals, how to train them, how to use them to identify features, etc

  • ​Alignment Evals (HHH, ChiSafety, SALAD, etc)

    • ​what it is, how to run an eval, things to consider when making one, how to make one, etc

  • Inspect - a framework that makes creating and using Evals a lot easier

​- Mentor access

​Host a Local Event

​Want to organize a local gathering? Register here: https://tally.so/r/wo0xRx

​---

​Join us in making meaningful contributions to AI alignment research while developing valuable skills and connections in the field.


Blue Teams:

Making Benchmarks for AI Alignment, to test how well the values are actually in the model. 

Potential Ideas:

  • Use Cross Coders to find the difference between Post Trained models and Base models, see if the desired safety features have actually been made

  • Making a multilingual version of benchmarks like HHH, SALAD, ChiSafety, etc., including speech patterns not commonly tested for

  • Set up a dataset of values for post training, have the post trained model make another version, post train a new model on that version, iterate, see how safety benchmark scores change over iterations.


Resources:


Learning materials for alignment evals (no need to read all of them, just find what you need): 

https://www.alignmentforum.org/posts/2PiawPFJeyCQGcwXG/a-starter-guide-for-evals 


https://inspect.ai-safety-institute.org.uk/ 


https://arxiv.org/pdf/2407.21792 

Avatar for AI-Plans
Presented by
AI-Plans
34 Going