Department of Engineering
LLMs are becoming more capable and society increasingly relies on them. This makes it important to ensure LLMs are safe. In this PhD you can use a variety of approaches, such as white-box mechanistic interpretability and black-box behavioural research to evaluate the safety of LLMs, monitor their behaviour at inference time, as well as devise strategies for reducing risk from LLMs. Initially, this PhD will focus on increasing CoT faithfulness and mitigating encoded reasoning.
This PhD is funded by Coefficient Giving, which has the following focus areas https://coefficientgiving.org/tais-rfp-research-areas/#6-encoded-reasoning-in-cot-and-inter-model-communication
The first 1.5 years of this PhD are scoped out and will be about investigating and carrying out either project 1 or project 2 (described below). After these projects have been completed to the highest standard, you will together with your supervisor and Coefficient Giving decide how to proceed, and what to investigate next.
https://www.cam.ac.uk/jobs/phd-studentship-in-monitoring-and-increasing-llm-safety-nm49585