<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Evaluation on Saurav Panigrahi</title><link>https://sauravpanigrahi.com/tags/evaluation/</link><description>Recent content in Evaluation on Saurav Panigrahi</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 01 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://sauravpanigrahi.com/tags/evaluation/feed.xml" rel="self" type="application/rss+xml"/><item><title>AI Safety Research Collaborations</title><link>https://sauravpanigrahi.com/work/ai-safety-research-collaborations/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://sauravpanigrahi.com/work/ai-safety-research-collaborations/</guid><description>&lt;p&gt;Research collaborations with &lt;a href="https://scholar.google.com/citations?user=p1NIunwAAAAJ&amp;amp;hl=en"&gt;Robert McCarthy&lt;/a&gt; at UCL and &lt;a href="https://lionellevine.github.io/"&gt;Lionel Levine&lt;/a&gt; and Jonathan Chang at Cornell.&lt;/p&gt;
&lt;h2 id="focus"&gt;Focus&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Self-preservation propensity in language models.&lt;/li&gt;
&lt;li&gt;Emergent misalignment after narrow training interventions.&lt;/li&gt;
&lt;li&gt;Normative drift due to emergent misalignment.&lt;/li&gt;
&lt;li&gt;Side effects of character or persona training.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="questions"&gt;Questions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;When a model resists shutdown or redirection, is the behavior instrumental or self-preservation-like?&lt;/li&gt;
&lt;li&gt;How can self-preservation propensity be measured without relying only on surface-level refusal behavior?&lt;/li&gt;
&lt;li&gt;Which training interventions create behavioral changes outside the intended target domain?&lt;/li&gt;
&lt;li&gt;How do character or persona training procedures affect alignment-relevant behavior?&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="artifacts"&gt;Artifacts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drive.google.com/file/d/1bm9W37CekUo4N1-RHFGvHFaElrJDPLD2/view?usp=sharing"&gt;Technical Report: Side Effects of Character Training: Quantifying Cross Constitution Drift in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drive.google.com/file/d/1wnWA0684P8JQwoXLxIiQrxr71bH6M3-d/view?usp=drive_link"&gt;Technical Report: Investigating Intrinsic Self-Preservation in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what-this-connects-to"&gt;What This Connects To&lt;/h2&gt;
&lt;p&gt;This work sits at the intersection of model evaluation, behavioral generalization, and AI safety.&lt;/p&gt;</description></item><item><title>Medmarks</title><link>https://sauravpanigrahi.com/work/medmarks/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://sauravpanigrahi.com/work/medmarks/</guid><description>&lt;p&gt;Medmarks is an open-source benchmark suite for evaluating medical capabilities in language models across a mix of verifiable and open-ended clinical tasks.&lt;/p&gt;
&lt;h2 id="focus"&gt;Focus&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Medical LLM evaluation.&lt;/li&gt;
&lt;li&gt;Verifiable and open-ended benchmark tasks.&lt;/li&gt;
&lt;li&gt;LLM-as-judge evaluation for non-verifiable tasks.&lt;/li&gt;
&lt;li&gt;Clinically relevant model capability tracking.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="artifacts"&gt;Artifacts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sophont.med/blog/medmarks"&gt;Medmarks v0.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2605.01417v1"&gt;Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Plausible vs Faithful</title><link>https://sauravpanigrahi.com/notes/plausible-vs-faithful/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://sauravpanigrahi.com/notes/plausible-vs-faithful/</guid><description>&lt;p&gt;Plausible reasoning sounds right.&lt;/p&gt;
&lt;p&gt;Faithful reasoning preserves the structure of the thing being reasoned about.&lt;/p&gt;
&lt;p&gt;That distinction matters because many failures do not look like nonsense. They look coherent. They explain themselves well. They use the right vocabulary. They produce an answer that could have been true.&lt;/p&gt;
&lt;p&gt;The problem is that &amp;ldquo;could have been true&amp;rdquo; is a weak standard.&lt;/p&gt;
&lt;p&gt;In writing, plausibility shows up as an argument that flows but hides a missing step. In research, it shows up as a result that has a clean story but rests on a proxy. In AI systems, it shows up as an answer that sounds grounded while drifting away from the actual process that produced it.&lt;/p&gt;</description></item></channel></rss>