1. Introduction
Marine ecosystems play a critical role in maintaining the balance of our planet’s ecosystem by supporting food security and contributing to climate regulation [
1], making their preservation essential for the long-term sustainability of the earth’s environment. Thus, there is a growing need to develop and test innovative monitoring systems to ensure the natural preservation of marine habitats. Modern technologies have already shown great potential in monitoring habitats and advancing our understanding of marine communities [
2]. Acoustic methods are commonly used for underwater investigations because they detect and enable the classification of sensitive targets even in low visibility conditions. Passive acoustic technologies (PAM), such as underwater microphones, or hydrophones, are particularly attractive as they allow for non-invasive continuous monitoring of marine ecosystems without interfering with biological processes [
3]. PAM has been shown to achieve various research and management goals by effectively detecting animal calls [
4]. These objectives may include tracking and localizing animals [
5,
6], species identification, identifying individuals [
3,
7], analyzing distributions and behavior [
8], and estimating animal density [
9].
The bottlenose dolphin (Tursiops truncatus) is a highly intelligent marine mammal and a critical species for researchers studying marine ecosystems [
10]. Like many other marine mammals, dolphins are acoustic specialists that rely on sounds for communication, reproduction, foraging, and navigational purposes. The acoustic communication of dolphins employs a wide range of vocalizations, including clicks, burst-pulses, buzzes, and whistles [
11]. Whistles, in particular, serve various social functions such as individual identification, group cohesion, and coordination of activities, such as feeding, resting, socializing, and navigation [
12]. Understanding and accurately detecting dolphin vocalizations is essential for monitoring their populations and assessing their role within marine ecosystems.
Traditional bioacoustics tools and algorithms for detecting dolphins have relied on spectrogram analysis, manual signal processing, and statistical methods [
13]. For example, the reference approach pursued in [
14] applies three noise removal algorithms to the spectrogram of a sound sample. Then a connected region search is conducted to link together sections of the spectrogram that are above a predetermined threshold and close in time and frequency. A similar technique exploits a probabilistic Hough transform algorithm to detect ridges similar to thick line segments, which are then adjusted to the geometry of the potential whistles in the image via an active contour algorithm [
15]. Other algorithmic methods aim to quantify the variation in complexity (randomness) occurring in the acoustic time series containing the vocalization, for example, by measuring signal entropy [
16]. While these techniques have helped study dolphin vocalizations, they can be time-consuming and may not always provide accurate results due to the complexity and variability of the signals. Researchers have thus turned to machine learning methods to improve detection accuracy and efficiency.
Early machine learning studies in the field of dolphin detection applied traditional classifiers, such as Hidden Markov Models (HMM) [
17] and Support Vector Machines (SVMs) [
18]. For instance, in [
19], a hidden Markov model was utilized for whistle classification; in [
20], classification and regression tree analysis was employed along with discriminant function analysis for categorizing parameters extracted from whistles; in [
21], a multilayer perceptron classifier was implemented for classifying short-time Fourier transforms (STFTs) and wavelet transform coefficient energies of whistles; lastly, in [
15] a random forest algorithm and a support vector machine were combined to classify features derived from the duration, frequency, and cepstrum domain of whistles (see [
22] for a review of the early literature).
More recently, researchers have employed deep learning methods to detect whistle vocalizations. Deep neural networks have demonstrated great potential in sound detection generally [
23] and underwater acoustic monitoring specifically [
24]. The Convolutional Neural Network (CNN) is one of the best-known deep learners. Though commonly considered an image classifier, CNNs have been applied to whale vocalizations, significantly reducing the false-positive rates compared to traditional algorithms while at the same time enhancing call detection [
25,
26]. In [
27], the authors compared four traditional methods for detecting dolphin echolocation clicks with six CNN architectures, demonstrating the superiority of the CNNs. In [
28], CNNs were shown to outperform human experts in dolphin call detection accuracy. CNNs have also been applied to automatically categorize dolphin whistles into distinct groups, as in [
29], and to extract whistle contours either by leveraging peak tracking algorithms [
30] or by training CNN-based models for semantic segmentation [
31].
Several studies for dolphin whistle classification have used data augmentation on the training set to enhance the performance of CNNs by reducing overfitting and increasing the size and variability of the available datasets [
29,
30,
32]. Dolphin vocalizations are complex and highly variable, as analyzed in [
33]. Unsurprisingly, some traditional music data augmentation methods, such as pitch shifting, time stretching, and adding background noise, have proven effective at this classification task. When synthesizing dolphin calls, care should be taken to apply augmentations to the audio signal rather than to the spectrograms since altering the spectrogram could distort the time-frequency patterns of dolphin whistles, which would result in the semantic integrity of the labels being compromised [
29,
34]. In [
29], primitive shapes were interjected into the audio signal to generate realistic ambient sounds in negative samples, and classical computer vision methods were used to create synthetic time-frequency whistles, which replaced the training data. Generative Adversarial Networks (GANs) have also been employed to generate synthetic dolphin vocalizations [
32]. The research underscores the efficacy of data augmentation and synthesis methods in enhancing both the precision and stability of dolphin whistle categorization models, especially in situations where the datasets are restricted or imbalanced.
The goal of this work is to continue exploring data augmentation techniques for the task of dolphin vocalization detection. Towards this end, we use the benchmark dolphin whistles dataset developed by Korkmaz et al. [
28] but apply data augmentation on the original test set of spectrograms to enlarge it rather than on the training set. The training set contains all spectrograms obtained from audio files recorded between June 24th and June 30th, while the test set is composed of the spectrograms of audio files recorded between July 13th and July 15th, a three-day window. Aside from augmenting the test set, we extract a three-day window (June 24th - June 26th ) from the training set as the validation set.
The proposed system outperforms previous state-of-the-art methods on the same dataset using the same testing protocol. We find our results interesting, especially since many misclassified audio samples are unclassifiable even by humans, so the classification result of our method is probably very close to maximum performance (AUC =1 is not obtainable). The main contribution of this study is the creation of a new baseline on this benchmark, along with a clear and repeatable criterion for testing various new developments in machine learning.
3. Experimental Results
The protocol used in our experiments mirrored that proposed in [
28]. However, we have used the validation set described in section 2.3.2 to learn which data augmentation methods to apply and the weights of the weighted sum rule. After choosing the weights with the validations set, we used the subdivision of the training and testing set described in [
28]. We wish to stress that the validation set has been extracted from the training set, so there is no overfitting on the test set. We gauged the performance of the model on the distinct test set by calculating the same performance indicators used in [
28]. The True Positive rate and the False Positive rate is used to ascertain Precision/Recall. These are used to generate the Receiver Operating Characteristic (ROC) curves and evaluate the corresponding Area Under the Curve (AUC):
where TP indicates True Positives, TN True Negatives, FP False Positives, and FN False Negatives.
In
Table 1, we compare a baseline ResNet50 with the proposed data augmented ResNet50 (named ResNet50_DA), increasing the size of the ensemble of ResNet50. ResNet50(x)_DA indicates combining by sum rule
ResNet50_DA networks.
We know that the performance increase recorded in
Table 1 may not seem high compared to the baseline. However, we find our results interesting because many of the misclassified samples are unclassifiable by humans, so we are probably already very close to the maximum performance (AUC =1 not obtainable). Furthermore, our results create a new baseline on an available data set that can be repeated for testing other methods.
Figure 4 reports the ROC curve for ResNet50(1) vs ResNet50(10)_DA. That plot clearly shows that our proposed approach outperforms ResNet50(1).
In
Table 2, we compare our proposed method with two other approaches using the same dataset with the same testing protocol, reporting a full set of performance indicators (accuracy, AUC, precision, and recall). Clearly, the proposed ensemble performs better than the methods reported in the literature, although with higher computational costs. We do not think this is a problem considering that the current computing power of GPUs and the developments expected in the coming years will reduce considerations of such costs. For example, using a NVIDIA 1080, we were able to classify a batch of 100 spectrograms in ~0.3 seconds (considering a stand-alone ResNet50). With a TitanRTX, we were able to classify a batch of 100 spectrograms in ~0.195 seconds (considering a stand-alone ResNet50).
The ROC-curves obtained by ResNet50(10)_DA and ResNet50(1) are reported in
Figure 4. It is very interesting to note that we obtain a True Positive rate of 0.9 with a False Positive rate of 0.02. Moreover, it is clear the ResNet50(10)_DA improves ResNet50(1).
Finally, we report the confusion matrix obtained by our proposed ensemble with the previous baseline on the same dataset. Even this test shows the reliability of the proposed method reduces the number of false noise and false whistle classifications with respect to the previous baselines.
4. Conclusions
The surge in human activities in marine environments has led to an influx of boats and ships that emit powerful acoustic signals, often impacting areas larger than 20 square kilometers. The underwater noise from larger vessels can surpass 100 PSI, disturbing marine mammals’ hearing, navigation, and foraging abilities, particularly for coastal dolphins [
38,
39]. Therefore, the monitoring and preservation of marine ecosystems and wildlife becomes paramount.
However, conventional monitoring technologies depend on detection methods that are less than ideal, thereby hindering our capacity to carry out extensive, long-term surveys. While automatic detection methods could significantly enhance our survey capabilities, their performance is typically subpar amidst high background noise levels. In this paper, we illustrated how deep learning techniques involving data augmentation can identify dolphin whistles with remarkable accuracy, positioning them as a promising candidate for standardizing the automatic processing of underwater acoustic signals.
Despite the need for additional research to confirm the efficacy of such techniques across various marine environments and animal species, we are confident that deep learning will pave the way for developing and deploying economically feasible monitoring platforms. We hope our new baseline will further the comparison of future deep learning techniques in this area.
Finally, we should stress the main cons of using this data set as a benchmark: the training and test set are from the same region (Dolphin’s Reef in Eilat, Israel), and samples were collected using the same acoustic recorder.