AI Safety Research Collaborations | Saurav Panigrahi

Research collaborations with Robert McCarthy at UCL and Lionel Levine and Jonathn Chang at Cornell.

Focus

Self-preservation propensity in language models.
Emergent misalignment after narrow training interventions.
Normative drift due to emergent misalignment.
Side effects of character or persona training.

Questions

When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?
How can self-preservation propensity be measured without relying only on surface-level refusal behavior?
Which training interventions create behavioral changes outside the intended target domain?
How do character or persona training procedures affect alignment-relevant behavior?

Artifacts

Side Effects of Character Training: Quantifying Cross Constitution Drift in LLMs
Accepted at ICML ‘26 Pluralistic Alignment.
Normative Drift in Emergent Misalignment
Research writeup on judgment drift and judgment collapse under EM fine-tuning.
Investigating Intrinsic Self-Preservation in LLMs
Technical report.

What This Connects To

This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.

The recurring problem is measurement: designing settings where the behavior being measured is actually the behavior of interest.