Aligning language models
Current ML models that predict human language are surprisingly powerful and might scale into transformative AI. What novel alignment failures will future models exhibit, how can we develop demonstrations of those failures, and how can we mitigate them?
Mentor
-
Ethan Perez
Ethan Perez is a Research Scientist at Anthropic. He has recently published work on "Discovering Language Model Behaviors with Model-Written Evaluations” and “Measuring Progress on Scalable Oversight for Large Language Models” and co-founded the Inverse Scaling Prize. Ethan’s research interests include robustness, model transparency, and the development of techniques to better understand and control AI systems. For more information, visit his website.
Research projects
-
Reducing catastropic risks from large language models
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
Developing demonstrations of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
Developing techniques for process-based supervision, such as learning from language feedback.
Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy) and power-seeking).
Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
Investigating whether the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects typically involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
Candidate selection problems
-
Problem 1
Please answer Problem 2 from Evan's candidate selection problems:
”Please pick one of the following three essay prompts to respond to:
What argument in “Risks from Learned Optimization” do you think is most likely to be wrong? Explain why.
Do you think the majority of the existential risk from AI comes from inner alignment concerns, outer alignment concerns, or neither? Explain why.
Discuss one way that you might structure an AI training process to mitigate inner alignment issues.”
The problem states your answer should be ~1000 words, but as little as 300 words is fine for the purpose of applying to this project (though you may re-use a longer response if you are using it as a part of your application for other mentors).
-
Problem 2
Problem description here.