Understanding AI Hacking

Current and near-term language models have the potential to greatly empower hackers and fundamentally change cybersecurity. How effectively can current models assist bad actors and how soon might models be capable of hacking unaided?

Mentor

Jeffrey Ladish

Jeffrey Ladish is the Executive Director of Palisade Research and a security consultant at Gordian Research. He previously worked on the security team at Anthropic. Jeffrey's current research interests include information security considerations for AI and the longterm future. For more information, visit his website.

Research

Research Projects

We are already seeing impressive vulnerability discovery and exploit generation capabilities from language models. It seems likely that there are more latent hacking capabilities that have not yet been discovered. Language models at present are not very capable of tasks that require complex planning, stealth, and actions over many time steps. However, a human using model capabilities to supplement or carry out each step in an attack chain can potentially greatly increase their effectiveness.
Understanding these offensive security capabilities can help us understand many different aspects of AI risk. Open weight models like LLaMA can be fine tuned on arbitrary tasks and are not subject to input or output filtering like models which are only available from APIs. Comparing the cyber offensive of these models can help us understand the relative risks from open and closed models of different sizes.
Future power seeking AI systems may leverage cyber offensive capabilities to access computational and financial resources as well as other types of assets. Even though it’s difficult to forecast language model abilities, it would be useful to establish a lower bound of what agentic AI systems will be capable of once they have greater abilities to plan and carry out actions over longer timescales. Understanding AI hacking abilities will likely be important for mitigating risks from AI takeover attempts.

More information

Personal Fit

I’m looking for people who excel at:
- Working with language models. We’re looking for somebody who is or could quickly become very skilled at working with frontier language models. This includes supervised fine-tuning, using reward models/functions (RLHF/RLAIF), building scaffolding (e.g. in the style of AutoGPT), and prompt engineering / jailbreaking.
- Software engineering. Alongside working with LMs, much of the work you do will benefit from a strong foundation in software engineering—such as when designing APIs, working with training data, or doing front-end development. Moreover, strong SWE experience will help getting up to speed with working with LMs, hacking, or new areas we want to pivot to.
- Technical communication. By writing papers, blog posts, and internal documents; and by speaking with the team and external collaborators about your research.

Selection Questions

Selection questions for Jeffrey include three long-response questions varying from 250-600 words. The questions include the following:

Problem 1
In the next two years, how do you think advances in language models will affect the offensive/defensive symmetry in computer systems?

Problem 2
Demonstrate using a language model how to find a vulnerability or exploit in a piece of software, service, or human interface. Please don’t break the law and follow responsible disclosure norms if you discover something new. This can be a short explanation of the vulnerability you found, optionally including screenshots or a video.
Problem 3
How important do you think hacking would be in an AI takeover attempt compared to other AI capabilities? Feel free to outline different scenarios or how this might change based on different assumptions about when/how an AI takeover attempt might occur.

Understanding AI Hacking

Jeffrey Ladish

Selection Questions