Skip to content

Transformers Can Learn In-Context with 99% Accuracy!

Transformers adapt their behavior through in-context learning (ICL), where they use examples in the input prompt to understand tasks. This ability is crucial for few-shot prompting, where several examples guide the model. The core of ICL involves attention mechanisms that form hypotheses from demonstration pairs to map new queries to outputs. The attention formula can be modified with an inverse temperature parameter, c, which changes how the model allocates attention. When c approaches infinity, attention acts like a nearest neighbor search, finding the closest match from examples. With finite c, it behaves like Gaussian kernel smoothing, weighting tokens by similarity. Transformers can learn algorithms like nearest neighbor or linear regression, potentially automating model selection and hyperparameter tuning. Recent research suggests transformers might perform ICL through gradient descent, linking linear attention to preconditioned gradient descent (PGD). This means one layer of linear attention can execute one step of PGD, allowing the model to learn from demonstrations and predict new outputs with high accuracy.

Source: towardsdatascience.com

Related Links

Related Videos