Selected references worth standing behind.

This is the broad index: papers, essays, posts, repos, tools, benchmarks, docs, hubs, and useful references.

AI Safety

Evaluation

  • Medmarks
    Benchmark suite for medical capabilities in language models.

  • LAB-Bench
    Paper. Benchmark for language models doing biology research tasks.

  • FOMO26
    Challenge. Foundation model challenge for brain MRI.

  • Open Graph Benchmark
    Benchmark suite. Standardized graph ML datasets, loaders, and evaluators.

  • RoboTwin
    Paper. Dual-arm robot benchmark using generative digital twins for scalable task and data generation.

Tool Use And Agents

  • Harness Engineering
    Essay. Building products with agents through environments, specs, and reliability loops.

  • Code Mode
    Post. Tool use through code interfaces rather than repeated chat-level tool calls.

  • Context Mode
    Post. Pattern for keeping agent context usable when tools produce large outputs.

  • Agents Learn Their Runtime
    Paper. Persistent versus reset Python interpreters in CodeAct-style training.

  • AI Gave Birth to the 100x Engineer
    Essay. Case study on compounding agent workflows with test harnesses and supporting tools.

ML Systems

Research Engineering

Programmable Biology

  • Evo 2
    Paper and code. Long-context genomic foundation model for sequence modeling and design.

  • HyenaDNA
    Paper. Long-context sequence models at nucleotide resolution.

  • AlphaFold
    Paper. Foundational protein structure prediction.

  • AlphaFold 3
    Paper. Structure prediction for biomolecular complexes and interactions.

  • OpenFold
    Open implementation and training stack for AlphaFold-style systems.

  • Rosalind
    Bioinformatics algorithms through concrete programming problems.