Gregory Mermoud’s Post

View profile for Gregory Mermoud, graphic

𝗔𝗜 + 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 · Distinguished Engineer @ Cisco · Innovation & Engineering Leader · 190+ granted patents 💡

Very insightful work by Anthropic’s interpretability team. And an amazing paper, with outstanding writing and figures. The idea is very simple: interpret LLMs by leveraging sparse autoencoders as surrogate models of the MLP of transformer blocks, which allow one to disambiguate the superposition of features captured by a single neuron. A simple idea, but a very careful and complex execution, as it is often the case in our line of work. The paper goes into many details and provide a large array of insights, although the gist of the implementation remains obfuscated due to the closed source nature of Claude. Too bad, because this is the kind of work that we need to better understand and eventually trust LLMs. This is demonstrated by the authors in the section ‘Influence on Behavior’, where they show that clamping some features to either high or low value during inference is “remarkably effective at modifying model outputs in specific, interpretable ways”. Hopefully this kind of work is going to be replicated and generalized to open-weights models, such that we have new ways to steer their behavior. https://lnkd.in/eVym7f_f #interpretability #xai #explainableai #steerableai #anthropic #claude #anthropic

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

To view or add a comment, sign in

Explore topics