Aligning language models

Current ML models that predict human language are surprisingly powerful and might scale into transformative AI. What novel alignment failures will future models exhibit, how can we develop demonstrations of those failures, and how can we mitigate them?

Mentor

Research projects

Candidate selection problems

  • Problem 1

    Please answer Problem 2 from Evan's candidate selection problems:

    ”Please pick one of the following three essay prompts to respond to:

    • What argument in “Risks from Learned Optimization” do you think is most likely to be wrong? Explain why.

    • Do you think the majority of the existential risk from AI comes from inner alignment concerns, outer alignment concerns, or neither? Explain why.

    • Discuss one way that you might structure an AI training process to mitigate inner alignment issues.”

    The problem states your answer should be ~1000 words, but as little as 300 words is fine for the purpose of applying to this project (though you may re-use a longer response if you are using it as a part of your application for other mentors).

  • Problem 2

    Problem description here.