WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models

  • Valentin Gabeff (Creator)
  • Marc Rußwurm (Creator)
  • Devis Tuia (Creator)
  • Alexander Mathis (Creator)

Dataset

Description

WildCLIP is a fine-tuned CLIP model that allows to retrieve camera-trap events with natural language from the Snapshot Serengeti dataset. This project intends to demonstrate how vision-language models may assist the annotation process of camera-trap datasets.

Here we provide the processed Snapshot Serengeti data used to train and evaluate WildCLIP, along with two versions of WildCLIP (model weights).

Details on how to run these models can be found in the project github repository.

Provided data (images and attribute annotations): 

The data consists of 380 x 380 image crops corresponding to the MegaDetector output of Snapshot Serengeti with a confidence threshold above 0.7. We considered only camera trap images containing single individuals.
Date made available6 May 2024
PublisherSwiss Federal Institute of Technology Lausanne (EFPL)

Cite this