Skip to content

89% Accuracy: Paligemma’s Water Segmentation in Satellite Images

In May 2024, Google introduced Paligemma, a visual language model (VLM) that combines a SigLIP-So400m vision encoder with a Gemma language model. This model processes images at resolutions of 224px, 448px, or 896px and can perform tasks like Visual Question Answering, object detection, and image segmentation. Paligemma was trained on datasets such as WebLi, openImages, and WIT, enabling it to identify objects without fine-tuning, though Google recommends fine-tuning for domain-specific applications. For segmentation, Paligemma requires 16 extra segmentation tokens representing a mask within the bounding box. The model’s capabilities were tested on segmenting water in satellite images using the Kaggle Satellite Image of Water Bodies dataset, which contains 2841 images. After preprocessing, 164 images were selected for fine-tuning. The process involved converting masks into text labels using a pre-trained variational auto-encoder (VAE). Despite challenges in data preparation, the default Paligemma model showed an 89% accuracy in segmenting water, outperforming a fine-tuned version.

Source: towardsdatascience.com

Related Links