75% Success Rate: Fine-Tuning CLIP for YouTube Thumbnails

Multimodal embedding models like CLIP can be fine-tuned to improve performance in specialized domains. A recent study demonstrated this by adapting CLIP to match YouTube video titles with thumbnails. The process involved extracting title-thumbnail pairs from a YouTube channel, creating negative pairs for training, and organizing data into triplets. The model was trained on just 53 examples, focusing on the final projection layer, which has about 1 million parameters. The training used the Multiple Negatives Ranking Loss, optimizing the similarity between positive pairs while minimizing it for negative pairs. After fine-tuning, the model achieved a 75% Recall@1 score on the test set, meaning it correctly matched thumbnails to their original titles 75% of the time. This performance highlights the effectiveness of fine-tuning even with limited data.

Source: medium.com