We're excited to share this progress update from one of Martian's AI safety grant teams. Their research introduces a novel approach to scaling mechanistic interpretability techniques by transferring insights between different language models.

Key contributions:

  • A new technique for transferring mechanistic interpretability insights across LLMs, reducing the need to analyze each model individually and potentially saving significant compute resources
  • A method for transferring safety behaviors between models using steering vectors, advancing our ability to mitigate undesirable behaviors in LLMs at scale

Read the complete progress report on LessWrong.

Interested in working on projects like this or exploring the frontiers of machine intelligence? We're hiring!