Concept-Based Interpretability

Identifying high-level concepts in ML models might be critical to predicting and restricting dangerous or otherwise unwanted behaviour. Can we identify structures corresponding to “goals” or dangerous capabilities within a model and surgically alter them?

Please note that each of the mentors below will be working on their own respective research agendas with separate application, admissions, and mentorship.

Mentors

Erik Jenner

Erik is a PhD student at the Center for Human-Compatible AI at UC Berkeley, advised by Stuart Russell. He is interested in using the internals of models to figure out what they’re doing at a high level. For example, he has been working on formalizing abstractions of computations and applying them to mechanistic anomaly detection and interpretability, and on a library for mechanistic anomaly detection. He is also excited about empirical work on high-level interpretability of neural networks: roughly explaining broad network behaviors (as opposed to explaining a small subcircuit in detail).

Research
Stephen Casper

Stephen Casper, also known as Cas, is a Ph.D student at MIT in the Algorithmic Alignment Group advised by Dylan Hadfield-Menell. His main research focuses are on interpreting and red-teaming AI systems. In the past, he has worked closely with over a dozen mentees on various alignment-related research projects.
Recent papers include “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback,” “Measuring the Success of Diffusion Models at Imitating Human Artists,” “Explore, Establish, Exploit: Red Teaming Language Models from Scratch,” and “Red Teaming Deep Neural Networks with Feature Synthesis Tools.”

Research
Jessica Rumbelow

Jessica Rumbelow is a MATS alum, and now leads research at Leap Laboratories, where she works on data-independent interpretability methods. She is currently finishing her PhD thesis on model-agnostic interpretability at St Andrews, and has previously published work on saliency mapping, AI in histopathology, and more recently glitch tokens and prototype generation. Her research interests are mainly in interpretability, and include black-box/model-agnostic approaches, data-independent evaluation, and knowledge discovery.

Research

Research Projects

Erik Jenner

The fact that we can access all the internals of our models is a big advantage, and I would like to make use of it in more ways than we can with current techniques. (We might also have to make use of it in new ways if e.g. deceptive alignment turns out to be a problem in practice.) Circuits-style interpretability is one approach, but for now it’s far from being practically useful, so I’m interested in either accelerating it a lot (e.g. via automation) or exploring alternative bets. One such alternative is mechanistic anomaly detection, which could give us similar benefits as interpretability, but without the need for human understanding. Another direction is high-level interpretability—being satisfied with less precise explanations than circuits, but in exchange being able to explain more complex behaviors. I think that concepts like abstractions of computations can be useful for all of these, but I’m also interested in e.g. approaches to mechanistic anomaly detection that don’t directly rely on abstractions.
Most of my research is a mix of conceptual and empirical work, meaning you might go back and forth between coming up with ideas, looking for theoretical counterexamples, and implementing the most promising ideas.
Stephen Casper
Specific research topics for research with MATS can be flexible depending on interest and fit. However, work will most likely involve one of three topics (or similar):
- More practical, safety-motivated tools for model editing: Even if we do not know specifically how an AI system might do something bad, we could lessen the risks of major failures by editing model knowledge that seems likely to be involved in failures. Existing work in model editing has focused on editing specific factual associations between specific entities, but there seems to be a gap between this and what may be more useful in practice. How might we identify and edit high-level goals in models? How might we ablate a model’s knowledge in an entire domain (e.g. bioengineering, or writing code)?
- Scoping models to their intended applications: We may be able to lessen some deployment risks with tools to impair a model’s ability to do anything in particular off distribution. For example, a model could be compressed, distilled, or trained on latent adversarial perturbations so that it retains in-distribution performance and (not-so-catastrophically) forgets possibly-risky behaviors off distribution. How could we do this practically and effectively? In practice, advanced AI systems are pretrained, finetuned, and adversarially trained to align them. Perhaps scoping could be considered another key stage in how we work to make them safer.
- Improving data and methods for identifying dishonesty in language models: There is currently a great deal of interest on identifying and avoiding untrue statements from language models. In practice, truth is an extremely fuzzy concept that is difficult to disentangle with confounding features. Moreover, real-life problems with dishonesty in language models will most likely involve subtle forms of dishonesty instead of blatantly false claims. What are the right baselines to use in cases like these? How can data and methods be improved?
Jessica Rumbelow
I’m interested in two key interpretability research themes: data independence and model agnosticity:
- Data Independence: Test-sets can only discover model behaviour that’s elicited by samples in the test set, and humans are probably bad at covering all the edge cases. Interpretability done with reference to data will be biased by that data to some degree - ideally we want to extract information directly from the model with as little dependence on training or test data as possible.
- Model Agnosticity/Black-box: A perfect description of every circuit in a network doesn’t make it possible for us to confidently predict overall model behaviour. To complement mechanistic interpretability, we need holistic interpretability methods that capture gestalt properties. See this post also.
Potential projects (short background reading):
- LLM specific:
  How can we map concepts from prototype generation to LLMs? What is a prototypical concept for a language model? Can we formalise this in terms of input/output?
  If we can extract prototypes that accurately reflect concepts learned by an LLM, and we can identify areas of overlap between these concepts, can we then construct a conceptual taxonomy?
- How might we combine prototype generation with mechanistic interpretability approaches for LLMs? (See circuits).
- We could model RL agents as classifiers over actuators. Can we use a prototype generation approach to visualise environments that lead to target action sequences?
- Can we quantify what makes an ‘ideal’ prototypical expression for a given target output? Can we optimise for this without goodharting?
- We’re not in the business of hacking models - we want to understand them, and while our process has many commonalities with adversarial example generation, adversarial examples are typically very difficult for humans to interpret. But, anti-adversarial transforms and objectives can also confound prototypical generations, as we end up optimising for something that looks sensible, rather than a true reflection of what the model has learned. Can we create anti-adversarial measures that they effectively mitigate adversarial generations without biasing prototypes?

Personal Fit

Erik Jenner
An ideal candidate would have:
- Strong programming skills (ideally Python);
- Significant experience with deep learning projects;
- Strong knowledge of topics in math, computer science, and machine learning;
- Research experience in some quantitative field like machine learning, math, physics,…
Projects can be chosen to match different backgrounds to some extent, so good candidates might also be especially strong in some of these points rather than all of them.
I will likely be a pretty hands-on mentor. I’ll encourage scholars to develop their own precise project (though I’ll have suggestions), but I would like to mentor projects broadly in the areas outlined above. Mentorship could look roughly as follows:
- 1h meeting/week with each scholar individually (potentially more during the research phase);
- Regular team meetings;
- Slack communication between meetings, I’ll aim to unblock you usually within a day;
- Detailed feedback on write-up drafts.
Stephen Casper
Positive signs of good fit:
- Research skills – see below under “Additional Questions”.
- Good paper-reading habits.
- Cas is usually at MIT. In-person meetings would be good but are definitely not necessary.
Mentorship will look like:
- Meeting 2-3x per week would be ideal.
- Frequent check-ins about challenges. A good rule of thumb is to ask for help after getting stuck on something for 30 minutes.
- A fair amount of independence with experimental design and implementation will be needed, but Cas can help with debugging once in a while. Clean code and good coordination will be key.
- An expectation for any mentee will be to regularly read and take notes on related literature daily.
A requirement for any mentee on day 1 will be to read these two posts, watch this video, and discuss them with Cas.
Jessica Rumbelow
- Candidates are expected to have some programming experience with standard deep learning frameworks (you should be able to train a model from scratch on a non-trivial problem, and debug it effectively), and to be able to read and implement concepts from academic papers easily.
- Candidates should be happy to document their research and code thoroughly, and have at least one research meeting per week. Candidates are also encouraged to join Leap’s regular research meetings if they choose.

Selection Questions

Erik Jenner

Selection questions include two long-response questions of about 400-800 words each. We estimate that well-thought out answers will take approximately two hours each.
Questions involve discussing how promising/unpromising you think mechanistic anomaly detection or interpretability is as an alignment research direction and why, describing a potential task for a mechanistic anomaly detection benchmark, and discussing some behavior of a neural network for which you would be excited about trying to understand how the network implements it. Both long-response questions include subsets of questions to assist you in your thinking about these responses. You can find the entirety of both of Erik’s selection questions here.
Stephen Casper
Selection questions include two long-response questions of about 250 words each in response to three readings. Additionally, Cas will interview some applicants prior to making final decisions for MATS mentees in this stream. The readings are the following:
- “Worst-Case Guarantees” by Paul Christiano;
- “Eight Strategies for Tackling the Hard Part of the Alignment Problem” by Stephen Casper;
- “EIS II: What is “Interpretability”?” by Stephen Casper.
The free response questions include:
- Write a proposal for work involving one of the three types of projects outlined above (or something similar). Focus on conveying good experimental motivation and ideas rather than presenting particularly polished writing or doing a thorough literature review. This proposal is meant for the selection process and is not a commitment to work on what it discusses.
- Write a critical perspective on an active area of AI safety research. It can be about any area of research – not necessarily the ones mentioned above. Identify a gap that exists between the assumptions/approaches that are taken in research and what would be needed for high-stakes AI safety in practice. Write about this gap, and suggest some ways to close it.
Jessica Rumbelow
Answer one or both. There are no wrong answers: we want to see how you approach a problem. You can answer with respect to a specific modality (e.g. LLMs, RL) or more generally. Spend no more than two hours, include notes and rough drafts, don’t spend time polishing.
1. You have access to a trained model’s outputs in response to any input. You can provide an unlimited number of inputs and receive responses. You do not have access to gradient information. What can you find out about what this model has learned from its training data? How does this change if you have access to gradients of inputs w.r.t. outputs?
2. We can generate (one or many) inputs to a model that result in a given output with 100% probability. How can we aggregate this information to provide a complete picture of the model’s behaviour? How might we use mechanistic interpretability techniques to map broader behaviours back to internal structures?
We may also interview some candidates.

Concept-Based Interpretability

Erik Jenner

Stephen Casper

Jessica Rumbelow

Erik Jenner

Stephen Casper

Jessica Rumbelow

Erik Jenner

Stephen Casper

Jessica Rumbelow

Erik Jenner

Stephen Casper

Jessica Rumbelow