Wilfred Justin’s Post

Wilfred Justin

Empowering cognitive computing, semantic search, and predictive analytics with AI

2mo

Deciphering AI's Inner Workings: Scaling Sparse Autoencoders to Claude 3 Sonnet - A good read! https://lnkd.in/eYuwg7_U

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

To view or add a comment, sign in

More Relevant Posts

Ievgen Goichuk

Finance | AI Engineering | Big Data Analytics | Engineering Lead
2mo Edited
Report this post
One big step closer to interpretable AI. Research is a much better way to safe AI than any regulations.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
ADAO, Advanced diagnostics analytics and optimization

124 followers
2mo
Report this post
Validating LLMs?! How about this: modifying certain features to change the behaviour of an LLM. Make it friendlier, more factual, change its context... how can we be in control? Here are two thoughts! - modify certain features/embeddings to determine if their values are appropriate? - define tests that the AI should pass, in order to 'pass' certain validation test. Here is the post qhere these two tiny ideas are based: https://lnkd.in/eaYkgSNM Follow ADAO for more content/info! #validation #ai #modelriskmanagement #llm

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
Alejandro Gomez, Ph.D.

AI | Market Risk | Trading Models | Python | R | Validation | Leader | Regulatory | Compliance
2mo
Report this post
ideas for model risk management for LLMs!! follow ADAO 😀

ADAO, Advanced diagnostics analytics and optimization

124 followers
2mo

Validating LLMs?! How about this: modifying certain features to change the behaviour of an LLM. Make it friendlier, more factual, change its context... how can we be in control? Here are two thoughts! - modify certain features/embeddings to determine if their values are appropriate? - define tests that the AI should pass, in order to 'pass' certain validation test. Here is the post qhere these two tiny ideas are based: https://lnkd.in/eaYkgSNM Follow ADAO for more content/info! #validation #ai #modelriskmanagement #llm

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

1 Comment
Like Comment
To view or add a comment, sign in
Jon Adams

Senior Director, Legal (AI + Data Ecosystem) at LinkedIn
2mo
Report this post
Wild. Recent research demonstrating that sparse autoencoders might be remarkably helpful in removing the 'black box' reputation of LLMs by fostering increased interpretability. https://lnkd.in/dEcmE3kj #ai #llm #interpretability #responsibleai #aiethics #aiact

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
Maxim Tishchenko

Director Software Engineering at IconGroup GmbH, Technology Leader & Director of Software Engineering | Expert in Cloud Computing and SaaS Solutions
2mo
Report this post
I've learned a fantastic paper about LLMs and how to research a group of neurons inside LLM responsible for various objects and how to tune it and put validation / modification to improve model #llm #ai #deeplearning https://lnkd.in/g6m4UZKG

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
Andrew Herndon

Strategic Researcher | Keeping Humans in the Loop on AI
2mo
Report this post
We refer to the internal processes an LLM (Large Language Model) goes through when considering output probabilities as a "black box." This term reflects the challenge of comprehending the vast number of comparisons the model performs and how it arrives at the intricate relationships between parts of words (tokens). Recent breakthroughs by Anthropic and others are opening a pinhole into these systems, potentially making AI models more interpretable and safer. Anthropic's recent progress involves using a technique called "dictionary learning" to uncover sets of neuron-like "nodes" in their LLMs. These nodes correspond to specific features, allowing us to glimpse into the model's system of logic – what we might anthropomorphize as its "mind." By mapping these nodes, researchers can better understand how LLMs process and represent information, possibly enabling them to: - Identify potential biases or inconsistencies in the model's reasoning - Improve the model's safety and alignment with intended goals - Enhance the interpretability and transparency of the model's decision-making processes I have recently thought out loud that the AI safety discussion revolves too much around Doom/Boom when in reality it needs to focus on Trust/Bust. Blackbox transparency is crucial to the next phase of adoption. This is by no means an easy task -the research represents a baby step in an exponentially larger picture. Anthropic's paper: https://lnkd.in/e_ERJNfZ Blog: https://lnkd.in/eR--mga5 #GenAI #Blackbox #Anthropic #LLM

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

2 Comments
Like Comment
To view or add a comment, sign in
Yu Cao
2mo
Report this post
Reflections on Sparse Representations in Language Models Reading about the use of sparse autoencoders in language models revealed an intriguing balance between interpretability and efficiency. Sparse representations enhance feature disentanglement, aiding in understanding model behaviors, especially in AI safety and bias detection. However, for practical applications where interpretability is secondary, sparse representations might introduce computational redundancy and inefficiency. Thus, while sparse autoencoders offer valuable insights for research and safety, more compact representations could be preferable for deployment. Balancing these aspects is crucial, potentially through adaptive approaches that optimize for both interpretability during research and efficiency in real-world applications. https://lnkd.in/egPP7dEg

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
JC C.

Dad, Technology & Security Expert, Data Privacy Evangelist,
1mo
Report this post
Exciting news in the world of AI, with huge implications regarding explainability and transparency. Anthropic interpretability team is working to better understand how AI works. This is a potential game changer and a big step towards secure & responsible AI.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in
Assaf Yablon

AI | Amazon | Ex-Microsoft | Harvard | MIT | Penn Engineering
1mo
Report this post
One of the biggest challenges in language models today is making them more interpretable. We often treat AI models as black boxes: data goes in, a response comes out, and the reasoning behind that response remains unclear. I remember an interview with Google CEO, where he was asked to explain how Gemini works. He said he didn’t know. This response resonated with the scientific community, as deep learning often similar to the human brain, but the interviewer was shocked! How can a model, released to millions, be so poorly understood? Two weeks ago, Anthropic released an important paper on model interpretability. They used a technique called "dictionary learning," borrowed from classical ML, which isolates patterns of neuron activations that recur across many different contexts. This paper sheds some light on this important challenge, which, if solved, will create more trust in these models and thus ease the integration of AI into our everyday lives. Highly recommend reading: https://lnkd.in/gPzEePx8

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

1 Comment
Like Comment
To view or add a comment, sign in
Benuraj Sharma

Senior Engineering Manager | Head of Applications & Algorithms Technical Unit, Multicoreware | Technology Leader
2mo
Report this post
Anthropic is trying to figure out what they actually built. They have made a major breakthrough (or is it???) in AI interpretability with their latest paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet". There are a bunch of “first times” mentioned in the paper, like for the first time they have extracted millions of understandable features from the middle layers of a state-of-the-art production language model. The researchers found a wide range of interpretable features corresponding to concepts like famous people, cities, scientific fields, code syntax, and more abstract ideas like security vulnerabilities and gender bias. Remarkably, they were able to manipulate the model's behavior by amplifying or suppressing specific features, like inducing it to self-identify as the Golden Gate Bridge. While there is still much more work to be done, this research lays critical foundations for making AI systems safer and more understandable. Anyhoo I know what I'll be diving into this weekend 😛 #ai #ml #aimusings

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub
Like Comment
To view or add a comment, sign in

2,598 followers

View Profile Follow

Wilfred Justin’s Post

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

More from this author

AWS PartnerCast - AI/ML RAA

Explore topics