Researchers at Anthropic have developed a technique called circuit tracing to observe the inner workings of their large language model, Claude 3.5 Haiku. By analyzing 10 different behaviors, they discovered that Claude uses language-neutral components to process information before selecting a specific language for responses. For instance, when asked the opposite of “small” in different languages, Claude first uses components related to smallness and opposites, then chooses the language for the reply. In solving math problems, Claude employs unique strategies, such as adding approximate values before refining to the exact answer. The model also plans ahead in poetry, selecting words for future lines in advance. Additionally, Anthropic found that post-training reduces hallucinations, but certain components can override this, leading to false statements, especially about well-known entities. Despite these insights, the full understanding of how these models form during training remains elusive.
Source: www.technologyreview.com
