Research collaborations with Robert McCarthy at UCL and Lionel Levine and Jonathan Chang at Cornell.

Focus

  • Self-preservation propensity in language models.
  • Emergent misalignment after narrow training interventions.
  • Normative drift due to emergent misalignment.
  • Side effects of character or persona training.

Questions

  • When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?
  • How can self-preservation propensity be measured without relying only on surface-level refusal behavior?
  • Which training interventions create behavioral changes outside the intended target domain?
  • How do character or persona training procedures affect alignment-relevant behavior?

Artifacts

What This Connects To

This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.

The recurring problem is measurement: designing settings where the behavior being measured is actually the behavior of interest.