4.1. Benchmark Comparison
Figure 3 presents a comparison of mean average precision (mAP) achieved by different methods on the Open-MIC dataset. The x-axis lists the methods, while the y-axis indicates the mAP values. The graph includes benchmark methods, our proposed methods, and highlights our best model. The benchmark methods serve as a reference point, showing the progression in performance over the years, including the Baseline [
23] , PaSST [
24], EAsT-KD + PaSST [
25], and DyMN-L [
26]. Our methods include various configurations of combining spectrogram features and attention mechanisms. Specifically, the "Single Log-Mel" approach, "Log-Mel CST Combined Spectrogram," and "Log-Mel CST with Attention Layer" configurations explore the impact of different spectrogram features on model performance. Additionally, we experimented with magnifying the spectrogram features to different extents: "Magnified 1/4 Size," "Magnified 1/2 Size," "Magnified 3/4 Size," and "Magnified Full Size."
Among these, the "Magnified 1/4 Size" model achieved a mAP of 0.8445, demonstrating a performance close to the leading benchmark methods. This model's success highlights the importance of carefully scaling the spectrogram features and incorporating attention mechanisms to enhance the model's focus on the most informative parts of the input data. The magenta marker and horizontal magenta line on the graph emphasize the noteworthy performance of this model, illustrating the potential effectiveness of our approach in musical instrument recognition tasks.
4.2. Evaluation Metrics Comparison among Each Scaled Multi-Spectrogram Settings
Figure 4 presents a comprehensive comparison of the precision, recall, and F1-score metrics for various instrument recognition models using different configurations of spectrogram features. The x-axis represents the different musical instruments, while the y-axis indicates the metric values.
The models compared include configurations such as Log-Mel 128 with Original CST Sizes (Chroma = 12, Spectral Contrast = 7, Tonnetz = 6), Log-Mel 128 with CST Magnified to 1/4 Size (Chroma = 32, Spectral Contrast = 32, Tonnetz = 32), Log-Mel 128 with CST Magnified to 1/2 Size (Chroma = 64, Spectral Contrast = 64, Tonnetz = 64), Log-Mel 128 with CST Magnified to 3/4 Size (Chroma = 96, Spectral Contrast = 96, Tonnetz = 96), and Log-Mel 128 with CST Magnified to Full Size (Chroma = 128, Spectral Contrast = 128, Tonnetz = 128). Each sub-plot within the figure illustrates a specific metric comparison, with the top plot showing Precision Comparison, the middle plot showing Recall Comparison, and the bottom plot showing F1-Score Comparison.
According to
Figure 4, for instruments like accordion, banjo, bass, drums, guitar, marimba, piano, synthesizer, and trumpet, high precision is consistently maintained across all configurations, indicating effective differentiation with minimal false positives. However, for cello, clarinet, flute, mandolin, violin, and voice, precision varies, suggesting certain spectrogram features better reduce false positives. Notably, the "Log-Mel 128 CST Magnified 1/4 Size" configuration generally provides a balanced performance, capturing essential characteristics effectively. Instruments like cymbals, organ, saxophone, and trombone exhibit significant precision fluctuations, indicating overlapping features with other instruments, making accurate differentiation more challenging.
Recall metrics reveal consistently high detection rates for accordion, banjo, bass, piano, synthesizer, trumpet, ukulele, violin, and voice, showcasing the model's effectiveness. Variability in recall for cello, clarinet, cymbals, flute, guitar, marimba, mandolin, and saxophone, with the "Log-mel 128 CST Magnified Full Size" often achieving higher recall, suggests larger feature sizes capture more relevant characteristics. Lower and more variable recall rates for drums, organ, and trombone indicate these instruments are less distinct or more challenging to detect accurately. High and consistent F1-scores for instruments like accordion, banjo, bass, drums, guitar, piano, synthesizer, and trumpet reflect a good balance between precision and recall. In contrast, cello, clarinet, cymbals, flute, mandolin, organ, saxophone, and trombone show fluctuating F1-scores, with the "Log-mel 128 CST Magnified 1/4 Size" and "Log-mel 128 CST Magnified Full Size" configurations often performing better. This indicates these configurations provide a better trade-off between detecting instruments and minimizing false predictions, while voice, violin, marimba, and ukulele show variability in F1-scores, suggesting room for optimization in feature size and attention mechanisms.