I. Introduction
Speech is the most basic form of human communication and is essential in comprehending behavior and cognition [
1]. Humans create speech with the help of the vocal system, which consists of the vocal folds (Larynx), the lungs, and the articulation system, which includes the lips, cheek, palate, tongue, and so on [
2]. When the air is expelled from the lungs, traveling through the windpipe and vocal folds, it causes vocal cords to vibrate, resulting in sound. The sound is shaped into recognizable words by the muscles controlling the soft palate, tongue, and lips [
3,
4]. The created speech is sensed by the human auditory system's ear and processed by the brain to produce an important response, action, or emotion [
2]. The human ear can respond to audio frequencies ranging from 20 Hz to 20 kHz, whereas the human voice frequency range is 300-3400 Hz. As a result, humans can only recognize frequencies below 4 kHz and rarely above 7.8 kHz. Thus, depending on the Nyquist sampling rate (Fs≥Fvoicemax), the necessary level of audio quality is sampled at 8 kHz, and the high level of audio quality is sampled at 16 kHz.
Several levels of information are contained in a speech signal besides linguistic content; it conveys information about a speaker's identity, gender, health, and emotional state [
5]. Speech processing has vast applications which can be categorized under automated speaker recognition and speech recognition. Speaker recognition is an important bio-feature recognition method that authenticates or identifies an individual using the specific characteristics obtained from their speech utterances [
6,
7]. Every individual's voice is different because of biological differences in the size and shape of their vocal cord and vocal tracts and behavioral differences [
8]. The application and popularity of speaker recognition have increased over time. The first voice recognition system was created by Bell Laboratories in 1952 [
9]. In 1956, several computer scientists put forward the concept of artificial intelligence, and then speaker recognition began to enter the era of artificial intelligence research [
10]. However, due to poor computer hardware capabilities and the immaturity of related algorithms, research on speaker identification did not achieve great results until the 1980s; as a powerful branch in the field of artificial intelligence, machine learning research began to use algorithms to analyze data, obtain relevant feature information from it, and then make decisions and predictions to solve the problem [
11]. The application has expanded from personal assistant service in mobile devices to secure access to highly secure areas and machines such as voice dialing, banking, databases, and computers for authentication and forensics [
6,
12]. In contrast, the speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is the ability of a machine or program to recognize spoken words and convert them into readable text [
13,
14]. This technology is widely used for virtual assistants (Siri, Alexa, Google Assistant), transcription services, voice-activated devices, and hand-free operation of smartphones and automobiles [
14,
15].
This emphasizes the importance of preparing and storing voice data in the broader field of recognition research and artificial intelligence. However, storing large amounts of data can be challenging. It is a matter of privacy concern as well as managing the quality and effectiveness of the data. Storing large volumes of data needs more space. It is very important to maintain the quality and the format of the voice data so that it can be compatible with various applications. So, it is important to manage the balance between privacy, security, and quality of the data for future uses.
Noise can affect the speaker or speech recognition applications of these speech signals. Background noise can obstruct speech comprehension by energetic masking, which occurs when the background noise has energy in the same frequency band as the speech signal, preventing the speech signal from being perceived [
16]. Noise is extremely challenging for speech systems to handle and requires various methods to remove the noise [
17]. There are some common types of noise that degrade the performance of any recognition system. Additive noise refers to background sounds such as fan noise, vacuum, air conditioner, or a baby crying, which are combined with the target speech signal at the microphone level, where their sound waves overlap [
17]. Convolutional noise occurs when someone speaks in a closed space, causing sound waves to bounce off walls and create a colored, echoey recording at the microphone, with larger spaces producing more reverberant sounds [
17]. Nonlinear distortion occurs when the speaker is too close to the microphone, or the sound on the device is set too high [
17]. Typically, a noisy environment is more difficult to fix, and not all solutions work for each type of noise interference. The commonly used high-pass filter is not effective for these varying noises.
There are some quality factor to evaluate the performance the filtering algorithm like SNR, Mean Square Error (MSE), Jittre, Total Harmonic Distortion (THD) and others. Most of the study canculated the SNR for the performance evaluation. Murugendrappa et al. [
18] introduce a novel approach for Adaptive Noise Cancellation (ANC) in speech signals affected by Gaussian white noise, utilizing adaptive Kalman filtering. To evaluate the Kalman filter performance they calculated the SNR. Notably, the Kalman filter achieves a signal-to-noise ratio (SNR) of around 1.17 dB and a Mean Squared Error (MSE) of 0.032, demonstrating its superior effectiveness in noise cancellation compared to other adaptive filters. Goh et al. [
19] developed a bidirectional Kalman filter for speech enhancement, utilizing a system dynamics model to estimate the current time state. The study compared this approach with conventional and fast adaptive Kalman filters, assessing performance based on correlation, SNR, WSS, and computation time. Results from testing on the TIDIGIT speech database revealed that the bidirectional Kalman filter improved robustness at low SNR, outperforming other methods in enhancing speech recognition rates when SNR was below or equal to 5 dB, despite requiring more iterations to reach a steady state. Zhou et al. determined the influence of noise environments and frequency distributions on voice identity perception. They compare the results of different noises using SNR. The results indicate that accuracy increases with SNR, and speech noise affects perception more than white noise and pink noise.
This study aims to develop a robust model to make voice sample noise resilient for further applications. This pipeline was intended to remove the background and unwanted noise without changing the voice characteristics. The pipeline scripts and documentation are also made open-source. Responding to the need to store a reduced template of the large voice samples, the proposed method will also extract the most commonly used features used for speech processing such as Mel-frequency cepstral coefficient (MFCC), pitch, zero crossing rate, and short-time energy features from the filtered audio signal. These features capture various aspects of a speaker's voice, providing rich information and might be sufficient for later speaker and speech recognition.
The overall contributions of this work can be summarized as follows:
An open-source preprocessing pipeline (including filtering, segmentation, and feature extraction) for voice data preparation and storage for future applications.
Model development is on a large longitudinal dataset and validated on a public dataset and locally collected dataset in different environments using SNR as the performance metric.
Performance comparison of the model with state-of-art Deep Learning based methods.
The rest of the paper is organized as follows. First the dataset description is given in Section 2. Section 3 describes the voice preprocessing method. Section 4 and Section 5 describe the segmentation and feature extraction. Section 7 gives the performance evalution and section 8 present the results. Section 8 represents discussion. Lastly, section 9 gives the conclusion of this study.
II. Dataset Description
The longitudinal voice dataset used in this study was collected from the local elementary, middle, and high school children aged between 4 and 18 years. This dataset consists of 14 collections, shown in
Figure 1, starting from summer 2016 with an approximate interval of six months. Collections 9 to 12 were not performed due to Covid-19.
The voice data was collected at a sampling rate of 44.1 kHz using one condenser-based and one mono-channel microphone set-up, as shown in
Figure 2. The condenser microphone is mostly used for studio recording applications [
20]. Condenser mics are generally made with a lightweight diaphragm (a sensitive conductive material), which is suspended by a fixed plate [
21,
22]. When sound waves reach the diaphragm, the sound pressure causes it to vibrate against the back plate. This causes the voltage between them to fluctuate. This fluctuation creates an electrical signal by mimicking the pattern of the incoming sound waves. An external power supply boosts the audio signal to produce an amplified sound [
23]. It is extremely sensitive and can pick up a range of frequencies, which makes it more suitable for quieter environments [
20,
24].
The mono-channel microphone was added from Collection 8. This directional microphone is a single capsule that only records sound from one channel [
25]. It is ideally suited to focus on sound coming from one specific source, usually the speaker in front, while disregarding sound coming from other directions, such as the sides and back of where the mic is positioned [
26].
During the collection, shown in
Figure 3, the children were shown a series of images, such as numbers from 1 to 10, common objects, and animals. They were instructed to utter the corresponding English words as they viewed the images. At the end, they were shown an image of a circus scene with different activities and were asked to describe it. The approximate duration of the voice recording was 90 seconds, but it varies for each subject depending on their speaking speed and pauses between words. The recorded dataset, which consists of 1629 audio recordings, has both text-dependent (numbers and object or animal images) and text-independent (circus scene image, the last 10 seconds) parts. Since the data is collected in a school environment, each subject's recordings have varying noises, like people talking, walking, and opening or closing the doors.
IV. Segmentation
Audio segmentation is a technique that divides audio signals into a sequence of segments or frames, and each part contains audio information from a speech [
36,
37,
38]. In this study, we present the voice activity detection (VAD) method for segmentation. VAD method detects the presence or absence of human speech [
39]. It can remove insignificant parts from the audio signal, such as silences or background noises, which increases efficiency and improves the recognition rate [
40]. The theory of short-time energy-based VAD is that voiced frames have more energy than silent. Therefore, by computing the energy frame and according to a predefined threshold, we can decide where the frame is voiced or silent [
41,
42].
Figure 6 shows the flowchart of the VAD process for segmentation.
Figure 6.
VAD process flowchart.
Figure 6.
VAD process flowchart.
In our study, VAD is applied to detect speech segments within the filter audio. The parameters of the VDA algorithm were configured to adapt the characteristics of the dataset. The filtered audio is divided into frames with a duration of 0.1, 0.5, 0.9, and 1 second with 50% overlap. A silence threshold of 0, 0.0001, 0.001, and 0.002 are used to differentiate speech from silence based on frame energy. Also, for VAD, we applied the grid search method as the speech varies for different speakers. Frames with energy exceeding the threshold are marked as speech segments. A robust VAD algorithm improves the performance of a speaker verification system by making sure that speaker identity is calculated only from speech regions [
43].
Figure 7 shows the segmentation result.
Figure 7.
Segmented Audio Signal.
Figure 7.
Segmented Audio Signal.
We highlighted each segment in different colors to visually understand different segments.
VII. Result
Figure 9 shows the before-and-after filtering results of one subject and the SNR results of the original and filtered audio.
Figure 9.
Original and Filtered audio signal and SNR result of condenser-based microphone.
Figure 9.
Original and Filtered audio signal and SNR result of condenser-based microphone.
The original audio's mean SNR was 3.76 dB, but after filtering, it increased to 30.75 dB; the standard deviation also increased from 8.08 to 14.54.
Figure 10 gives the filtered and SNR results before and after filtering of the same mono-channel microphone subject.
Figure 10.
Original and Filtered audio signal and SNR result of mono-channel microphone.
Figure 10.
Original and Filtered audio signal and SNR result of mono-channel microphone.
The mono-channel microphone has less noise, so the mean SNR of the original audio is higher, 10.49 dB, than the condenser-based microphone. After filtering, it removed the rest of the noises, and the output SNR increased to 46.35 dB.
Figure 11 shows the result of the high-pass filter at 150 Hz cut-off frequency.
Figure 11.
Result of HPF for a cut-off frequency of 150 Hz.
Figure 11.
Result of HPF for a cut-off frequency of 150 Hz.
After filtering, the mean SNR result decreased for each cut-off frequency. The mean SNR was 3.36 dB for the 150 Hz cut-off frequency.
Figure 12 shows the filtering result of a locally collected dataset.
Figure 12.
Original and Filtered audio signal and SNR result of local data.
Figure 12.
Original and Filtered audio signal and SNR result of local data.
The local dataset consists of varying noises, as it was collected in a real-world environment. The result shows that after filtering, the unwanted background noises are removed, and the SNR improves to 17.26 dB.
Figure 13 shows the performance of our algorithm using a public dataset.
Figure 13.
Original and Filtered audio signal and SNR result of public dataset.
Figure 13.
Original and Filtered audio signal and SNR result of public dataset.
The improved mean SNR result of 2.66 dB suggested the filtering algorithm is performing well for other datasets.
Figure 14 shows the result of the deep learning wave-u-net model.
Figure 14.
Original and filtered audio signal and the SNR result of Wave-U-Net.After filtering using a deep learning model, the mean SNR is 12.73 dB, which is less than our proposed model result.
Figure 14.
Original and filtered audio signal and the SNR result of Wave-U-Net.After filtering using a deep learning model, the mean SNR is 12.73 dB, which is less than our proposed model result.
In parallel, we ran a speaker recognition algorithm through the dataset before and after filtering. We used the VeriSpeak algorithm for speaker recognition and ran an N: N test throughout the dataset. The result is shown in
Table 1.
VIII. Discussion
The filter and SNR results of
Figure 9 and
Figure 10 indicate that our algorithm works for both microphone audio signals. We can see that after filtering, it removes the background noises and keeps the speech part of the audio. Both the original and filter audio lengths are the same. From the result, we can see that the mean SNR value of the original signal was 3.76 dB, and for the filtered audio, the mean SNR was 30.75 dB. It means that the filtered audio signal has a higher level of signal power relative to the noise power. In short, the filter applied to the audio signal has improved the signal quality by reducing noise or enhancing the desired signal components.
Following the segmentation process in
Figure 7, the pauses were removed, retaining only the speech components. This effectively removes any remaining noise while also enhancing the audio signal. From this speech signal, we extracted some features which are stored instead of the original signal. This method uses the least amount of space while optimizing storage. In order to ensure effective access and further analysis, the recorded features act as a representative and compressed form of the speech signal. This method not only preserves important speech features but also shows a way to data storage resource efficiency, which improves our algorithm's overall effectiveness and usefulness.
The input of the mono channel microphone in
Figure 10 shows the presence of some noise. We obtained a 10.49 dB mean SNR result after filtering the SNR value, which increases to 46.35 dB. As the mono-channel microphone captures less noise, the original and the filtered signal are almost the same. There were a few noises that were captured by the microphone, and they were removed after filtering.
In the traditional HPF from
Figure 11, the filter only allows passing higher frequencies after certain cut-off frequencies. There is a trade-off between noise reduction and loss of important information. As the HPF only removes the lower frequency, the higher frequency noise still remains in the signal. In some cases, we lose the original audio if the cut-off frequency is set too high. In contrast to our algorithm, there is less chance of losing any important parameters of the audio recordings.
We tested the performance on a locally collected dataset from the Clarkson University cafeteria, which contained varying audio noises (
Figure 12), as well as on a publicly available dataset (SpEAR), where noise was introduced to clean audio signals at different decibel levels. In
Figure 13, the subject data shows pink noise added at 16 dB. After filtering, the noise was removed. From the SNR result, we can see the improvement. The mean SNR result of noisy recording was 0.48 dB, and after filtering, the SNR result improved to 2.66 dB. This result indicates that our algorithm is effective over a wide range of datasets. Across each dataset, our algorithm consistently demonstrates an enhancement in SNR after the filtering, highlighting its robust performance across diverse audio environments.
In evaluating the efficiency of our method compared to the deep learning model, it is notable that our approach performs well, significantly improving the SNR after filtering. The adaptability of our algorithm further demonstrated its effectiveness across diverse datasets. These results underscore the algorithm's efficacy in addressing real-world challenges with varying noisy environments, proposing it as a valuable contribution to noise reduction techniques.
Although our denoising algorithm is performing well, some limitations need to be addressed. As the data is collected in a school setting, some background noises have the same frequency level as the speaker; this kind of noise is hard to remove. To address this issue in the future, we need to design more advanced filters for multi-dominant denoising. In addition, we will explore the application of deep learning methods for further denoising.
As the noise varies for all the subjects, it prevents fixing the parameters for preprocessing, which increases the computational time. To address this, we will explore machine learning clustering techniques to classify the subjects based on noise profiles and real-time monitoring systems for dynamic parameters.
Furthermore, we need to expand the feature extraction approach. Expanding the feature will improve and increase the adaptability for future applications. It may improve the dataset's usefulness by including a wide range of attributes, providing a more complex and flexible tool for later research and use.