50% Faster: Hybrid Attention in Llama-3.2-1B

Researchers have developed a hybrid attention mechanism for the Llama-3.2-1B model, combining softmax sliding window and linear attention. This modification replaces 50% of the model’s self-attention layers, reducing the time complexity from O(n) to O(n*w) where w is the window size. The hybrid model was trained using 1M tokens from the fineweb-edu dataset and fine-tuned with the Dolly-15K dataset. Performance tests show that as sequence length increases, the hybrid model becomes significantly faster than the original, with throughput improvements becoming evident. Accuracy tests on the MMLU benchmark indicate a slight edge over the original model, with a 2% increase in accuracy. However, the performance can vary greatly depending on the GPU used, with accuracy differences of up to 6.75% observed between different hardware. This highlights the impact of hardware on model performance and the potential of hybrid attention mechanisms to enhance inference speed while maintaining accuracy.

Source: towardsdatascience.com