PhD Studentship in Monitoring and Increasing LLM Safety

Closing date

30 July, 2026

Department of Engineering

LLMs are becoming more capable and society increasingly relies on them. This makes it important to ensure LLMs are safe. In this PhD you can use a variety of approaches, such as white-box mechanistic interpretability and black-box behavioural research to evaluate the safety of LLMs, monitor their behaviour at inference time, as well as devise strategies for reducing risk from LLMs. Initially, this PhD will focus on increasing CoT faithfulness and mitigating encoded reasoning.

This PhD is funded by Coefficient Giving, which has the following focus areas https://coefficientgiving.org/tais-rfp-research-areas/#6-encoded-reasoning-in-cot-and-inter-model-communication

The first 1.5 years of this PhD are scoped out and will be about investigating and carrying out either project 1 or project 2 (described below). After these projects have been completed to the highest standard, you will together with your supervisor and Coefficient Giving decide how to proceed, and what to investigate next.

Project 1: Test for straightforward meaning of CoT and mitigate deceptive behaviour via "perturbation methods".

First apply a CoT perturbation method (e.g. applying paraphrasing to intermediate outputs). You then compare the final outputs after the CoT is perturbed with baseline final outputs. Performance deterioration after applying perturbation methods, indicates the model was using words in the CoT in a non-straightforward way. If you find performance deterioration after applying perturbation methods, the next step is investigating (for example using mechanistic interpretability) the underlying cause e.g. the model using a secret code or prompt hacking itself.

Project 2: Train for transparency using a human predictor

Use a human (or AI imitating human behavior, e.g. an LLM) to evaluate whether the final model outputs (and counterfactual outputs) can be predicted based on the CoT. The accuracy of this human predictor is a measure of reasoning transparency and can be used as reward during training.

Qualifications required (edit as necessary): Applicants should have (or expect to obtain by the start date) at least a first degree in an Engineering or related subject.

Ideally applicants have some experience with either software development projects or research on LLMs.

This is a fully-funded studentship (fees and maintenance) to cover a home or overseas candidate....

https://www.cam.ac.uk/jobs/phd-studentship-in-monitoring-and-increasing-llm-safety-nm49585-0

Warning message