In December 2024, Meta introduced the Apollo family of models, focusing on Large Multimodal Models (LMMs) that can process various input types. The paper detailing their research highlights four major design choices. Firstly, every Transformer model uses embeddings, converting user inputs like text or videos into tokens and then embeddings. For multimodal inputs, different encoders are used for each type. This approach allows for a more nuanced understanding and processing of diverse data forms, enhancing the model’s capability to interpret and generate responses based on complex inputs. The Apollo models represent a significant step forward in AI’s ability to handle and understand multimedia content, setting new benchmarks in the field.
Source: towardsdatascience.com















