Sparse Autoencoders can help interpret complex neural networks like Large Language Models (LLMs) by disentangling superposed features. A toy example with a neural network having one hidden layer of three dimensions was used to demonstrate this. The model was trained on tokens like “cat,” “happy cat,” “dog,” “energetic dog,” “not cat,” “not dog,” “robot,” and “AI assistant.” After training, the Sparse Autoencoder mapped these activations into a 20-dimensional space, revealing four interpretable features. Feature 1 activated for “cat,” “happy cat,” “dog,” and “energetic dog,” suggesting an animal or pet-related concept. Feature 2 activated for “robot” and “AI assistant,” indicating a technology-related feature. Feature 3 activated for “not cat,” “not dog,” “robot,” and “AI assistant,” possibly representing non-animal entities. This simple example shows how Sparse Autoencoders can extract meaningful, human-friendly features from neural networks, even when the original model is simplistic.
Source: medium.com















