Evaluation on Saurav Panigrahi

AI Safety Research Collaborations

Fri, 01 May 2026 00:00:00 +0000

Research collaborations with Robert McCarthy at UCL and Lionel Levine and Jonathan Chang at Cornell.

Focus

Self-preservation propensity in language models.
Emergent misalignment after narrow training interventions.
Normative drift due to emergent misalignment.
Side effects of character or persona training.

Questions

When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?
How can self-preservation propensity be measured without relying only on surface-level refusal behavior?
Which training interventions create behavioral changes outside the intended target domain?
How do character or persona training procedures affect alignment-relevant behavior?

Artifacts

What This Connects To

This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.

Medmarks

Fri, 01 May 2026 00:00:00 +0000

Medmarks is an open-source benchmark suite for evaluating medical capabilities in language models across a mix of verifiable and open-ended clinical tasks.

Focus

Medical LLM evaluation.
Verifiable and open-ended benchmark tasks.
LLM-as-judge evaluation for non-verifiable tasks.
Clinically relevant model capability tracking.

Artifacts

Plausible vs Faithful

Fri, 01 May 2026 00:00:00 +0000

Plausible reasoning sounds right.

Faithful reasoning preserves the structure of the thing being reasoned about.

That distinction matters because many failures do not look like nonsense. They look coherent. They explain themselves well. They use the right vocabulary. They produce an answer that could have been true.

The problem is that “could have been true” is a weak standard.

In writing, plausibility shows up as an argument that flows but hides a missing step. In research, it shows up as a result that has a clean story but rests on a proxy. In AI systems, it shows up as an answer that sounds grounded while drifting away from the actual process that produced it.