A Pipeline for Identification of Bird and Frog Species in Tropical Soundscape Recordings Using a Convolutional Neural Network

Jack LeBien; Ming Zhong; Marconi Campos-Cerqueira; Julian P. Velev; Rahul Dodhia; Juan M. Lavista Ferres; Mitchell Aidead

A Pipeline for Identification of Bird and Frog Species in Tropical Soundscape Recordings Using a Convolutional Neural Network

Jack LeBien ,
Ming Zhong ,
Marconi Campos-Cerqueira ,
Julian P. Velev ,
Rahul Dodhia ,
Juan M. Lavista Ferres ,
Mitchell Aidead

Science Direct | September 2020

Download BibTex

Automated acoustic recorders can collect long-term soundscape data containing species-specific signals in remote environments. Ecologists have increasingly used them for studying diverse fauna around the globe. Deep learning methods have gained recent attention for automating the process of species identification in soundscape recordings. We present an end-to-end pipeline for training a convolutional neural network (CNN) for multi-species multi-label classification of soundscape recordings, starting from raw, unlabeled audio. Training data for species-specific signals are collected using a semi-automated procedure consisting of an efficient template-based signal detection algorithm and a graphical user interface for rapid detection validation. A CNN is then trained based on mel-spectrograms of sound to predict the set of species present in a recording. Transfer learning of a pre-trained model is employed to reduce the necessary training data and time. Furthermore, we define a loss function that allows for using true and false template-based detections to train a multi-class multi-label audio classifier. This approach leverages relevant absence (negative) information in training, and reduces the effort in creating multi-label training data by allowing weak labels. We evaluated the pipeline using a set of soundscape recordings collected across 749 sites in Puerto Rico. A CNN model was trained to identify 24 regional species of birds and frogs. The semi-automated training data collection process greatly reduced the manual effort required for training. The model was evaluated on an excluded set of 1000 randomly sampled 1-min soundscapes from 17 sites in the El Yunque National Forest. The test recordings contained an average of ~3 present target species per recording, and a maximum of 8. The test set also showed a large class imbalance with most species being present in less than 5% of recordings, and others present in >25%. The model achieved a mean-average-precision of 0.893 across the 24 species. Across all predictions, the total average-precision was 0.975.